introduction-to-data-mining-2nbsped-2017048641-9780133128901-013312890311.pdf

INTRODUCTIONTODATAMINING

INTRODUCTIONTODATAMININGSECONDEDITION

PANG-NINGTANMichiganStateUniversitMICHAELSTEINBACHUniversityofMinnesotaANUJKARPATNEUniversityofMinnesotaVIPINKUMARUniversityofMinnesota

330HudsonStreet,NYNY10013

Director,PortfolioManagement:Engineering,ComputerScience&GlobalEditions:JulianPartridge

Specialist,HigherEdPortfolioManagement:MattGoldstein

PortfolioManagementAssistant:MeghanJacoby

ManagingContentProducer:ScottDisanno

ContentProducer:CaroleSnyder

WebDeveloper:SteveWright

RightsandPermissionsManager:BenFerrini

ManufacturingBuyer,HigherEd,LakeSideCommunicationsInc(LSC):MauraZaldivar-Garcia

InventoryManager:AnnLam

ProductMarketingManager:YvonneVannatta

FieldMarketingManager:DemetriusHall

MarketingAssistant:JonBryant

CoverDesigner:JoyceWells,jWellsDesign

Full-ServiceProjectManagement:ChandrasekarSubramanian,SPiGlobal

Copyright©2019PearsonEducation,Inc.Allrightsreserved.ManufacturedintheUnitedStatesofAmerica.ThispublicationisprotectedbyCopyright,andpermissionshouldbeobtainedfromthepublisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformorbyanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregardingpermissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights&Permissionsdepartment,pleasevisitwww.pearsonhighed.com/permissions/.

Manyofthedesignationsbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademarkclaim,thedesignationshavebeenprintedininitialcapsorallcaps.

LibraryofCongressCataloging-in-PublicationDataonFile

Names:Tan,Pang-Ning,author.|Steinbach,Michael,author.|Karpatne,Anuj,author.|Kumar,Vipin,1956-author.

Title:IntroductiontoDataMining/Pang-NingTan,MichiganStateUniversity,MichaelSteinbach,UniversityofMinnesota,AnujKarpatne,UniversityofMinnesota,VipinKumar,UniversityofMinnesota.

Description:Secondedition.|NewYork,NY:PearsonEducation,[2019]|Includesbibliographicalreferencesandindex.

Identifiers:LCCN2017048641|ISBN9780133128901|ISBN0133128903

Subjects:LCSH:Datamining.

Classification:LCCQA76.9.D343T352019|DDC006.3/12–dc23LCrecordavailableathttps://lccn.loc.gov/2017048641

118

ISBN-10:0133128903

ISBN-13:9780133128901

Toourfamilies…

PrefacetotheSecondEditionSincethefirstedition,roughly12yearsago,muchhaschangedinthefieldofdataanalysis.Thevolumeandvarietyofdatabeingcollectedcontinuestoincrease,ashastherate(velocity)atwhichitisbeingcollectedandusedtomakedecisions.Indeed,theterm,BigData,hasbeenusedtorefertothemassiveanddiversedatasetsnowavailable.Inaddition,thetermdatasciencehasbeencoinedtodescribeanemergingareathatappliestoolsandtechniquesfromvariousfields,suchasdatamining,machinelearning,statistics,andmanyothers,toextractactionableinsightsfromdata,oftenbigdata.

Thegrowthindatahascreatednumerousopportunitiesforallareasofdataanalysis.Themostdramaticdevelopmentshavebeenintheareaofpredictivemodeling,acrossawiderangeofapplicationdomains.Forinstance,recentadvancesinneuralnetworks,knownasdeeplearning,haveshownimpressiveresultsinanumberofchallengingareas,suchasimageclassification,speechrecognition,aswellastextcategorizationandunderstanding.Whilenotasdramatic,otherareas,e.g.,clustering,associationanalysis,andanomalydetectionhavealsocontinuedtoadvance.Thisneweditionisinresponsetothoseadvances.

Overview

Aswiththefirstedition,thesecondeditionofthebookprovidesacomprehensiveintroductiontodataminingandisdesignedtobeaccessibleandusefultostudents,instructors,researchers,andprofessionals.Areas

coveredincludedatapreprocessing,predictivemodeling,associationanalysis,clusteranalysis,anomalydetection,andavoidingfalsediscoveries.Thegoalistopresentfundamentalconceptsandalgorithmsforeachtopic,thusprovidingthereaderwiththenecessarybackgroundfortheapplicationofdataminingtorealproblems.Asbefore,classification,associationanalysisandclusteranalysis,areeachcoveredinapairofchapters.Theintroductorychaptercoversbasicconcepts,representativealgorithms,andevaluationtechniques,whilethemorefollowingchapterdiscussesadvancedconceptsandalgorithms.Asbefore,ourobjectiveistoprovidethereaderwithasoundunderstandingofthefoundationsofdatamining,whilestillcoveringmanyimportantadvancedtopics.Becauseofthisapproach,thebookisusefulbothasalearningtoolandasareference.

Tohelpreadersbetterunderstandtheconceptsthathavebeenpresented,weprovideanextensivesetofexamples,figures,andexercises.Thesolutionstotheoriginalexercises,whicharealreadycirculatingontheweb,willbemadepublic.Theexercisesaremostlyunchangedfromthelastedition,withtheexceptionofnewexercisesinthechapteronavoidingfalsediscoveries.Newexercisesfortheotherchaptersandtheirsolutionswillbeavailabletoinstructorsviatheweb.Bibliographicnotesareincludedattheendofeachchapterforreaderswhoareinterestedinmoreadvancedtopics,historicallyimportantpapers,andrecenttrends.Thesehavealsobeensignificantlyupdated.Thebookalsocontainsacomprehensivesubjectandauthorindex.

WhatisNewintheSecondEdition?

Someofthemostsignificantimprovementsinthetexthavebeeninthetwochaptersonclassification.Theintroductorychapterusesthedecisiontreeclassifierforillustration,butthediscussiononmanytopics—thosethatapply

acrossallclassificationapproaches—hasbeengreatlyexpandedandclarified,includingtopicssuchasoverfitting,underfitting,theimpactoftrainingsize,modelcomplexity,modelselection,andcommonpitfallsinmodelevaluation.Almosteverysectionoftheadvancedclassificationchapterhasbeensignificantlyupdated.ThematerialonBayesiannetworks,supportvectormachines,andartificialneuralnetworkshasbeensignificantlyexpanded.Wehaveaddedaseparatesectionondeepnetworkstoaddressthecurrentdevelopmentsinthisarea.Thediscussionofevaluation,whichoccursinthesectiononimbalancedclasses,hasalsobeenupdatedandimproved.

Thechangesinassociationanalysisaremorelocalized.Wehavecompletelyreworkedthesectionontheevaluationofassociationpatterns(introductorychapter),aswellasthesectionsonsequenceandgraphmining(advancedchapter).Changestoclusteranalysisarealsolocalized.TheintroductorychapteraddedtheK-meansinitializationtechniqueandanupdatedthediscussionofclusterevaluation.Theadvancedclusteringchapteraddsanewsectiononspectralgraphclustering.Anomalydetectionhasbeengreatlyrevisedandexpanded.Existingapproaches—statistical,nearestneighbor/density-based,andclusteringbased—havebeenretainedandupdated,whilenewapproacheshavebeenadded:reconstruction-based,one-classclassification,andinformation-theoretic.Thereconstruction-basedapproachisillustratedusingautoencodernetworksthatarepartofthedeeplearningparadigm.Thedatachapterhasbeenupdatedtoincludediscussionsofmutualinformationandkernel-basedtechniques.

Thelastchapter,whichdiscusseshowtoavoidfalsediscoveriesandproducevalidresults,iscompletelynew,andisnovelamongothercontemporarytextbooksondatamining.Itsupplementsthediscussionsintheotherchapterswithadiscussionofthestatisticalconcepts(statisticalsignificance,p-values,falsediscoveryrate,permutationtesting,etc.)relevanttoavoidingspuriousresults,andthenillustratestheseconceptsinthecontextofdata

miningtechniques.Thischapteraddressestheincreasingconcernoverthevalidityandreproducibilityofresultsobtainedfromdataanalysis.Theadditionofthislastchapterisarecognitionoftheimportanceofthistopicandanacknowledgmentthatadeeperunderstandingofthisareaisneededforthoseanalyzingdata.

Thedataexplorationchapterhasbeendeleted,ashavetheappendices,fromtheprinteditionofthebook,butwillremainavailableontheweb.Anewappendixprovidesabriefdiscussionofscalabilityinthecontextofbigdata.

TotheInstructor

Asatextbook,thisbookissuitableforawiderangeofstudentsattheadvancedundergraduateorgraduatelevel.Sincestudentscometothissubjectwithdiversebackgroundsthatmaynotincludeextensiveknowledgeofstatisticsordatabases,ourbookrequiresminimalprerequisites.Nodatabaseknowledgeisneeded,andweassumeonlyamodestbackgroundinstatisticsormathematics,althoughsuchabackgroundwillmakeforeasiergoinginsomesections.Asbefore,thebook,andmorespecifically,thechapterscoveringmajordataminingtopics,aredesignedtobeasself-containedaspossible.Thus,theorderinwhichtopicscanbecoveredisquiteflexible.Thecorematerialiscoveredinchapters2(data),3(classification),5(associationanalysis),7(clustering),and9(anomalydetection).WerecommendatleastacursorycoverageofChapter10(AvoidingFalseDiscoveries)toinstillinstudentssomecautionwheninterpretingtheresultsoftheirdataanalysis.Althoughtheintroductorydatachapter(2)shouldbecoveredfirst,thebasicclassification(3),associationanalysis(5),andclusteringchapters(7),canbecoveredinanyorder.Becauseoftherelationshipofanomalydetection(9)toclassification(3)andclustering(7),thesechaptersshouldprecedeChapter9.

Varioustopicscanbeselectedfromtheadvancedclassification,associationanalysis,andclusteringchapters(4,6,and8,respectively)tofitthescheduleandinterestsoftheinstructorandstudents.Wealsoadvisethatthelecturesbeaugmentedbyprojectsorpracticalexercisesindatamining.Althoughtheyaretimeconsuming,suchhands-onassignmentsgreatlyenhancethevalueofthecourse.

SupportMaterials

Supportmaterialsavailabletoallreadersofthisbookareavailableathttp://www-users.cs.umn.edu/~kumar/dmbook.

PowerPointlectureslidesSuggestionsforstudentprojectsDataminingresources,suchasalgorithmsanddatasetsOnlinetutorialsthatgivestep-by-stepexamplesforselecteddataminingtechniquesdescribedinthebookusingactualdatasetsanddataanalysissoftware

Additionalsupportmaterials,includingsolutionstoexercises,areavailableonlytoinstructorsadoptingthistextbookforclassroomuse.Thebook’sresourceswillbemirroredatwww.pearsonhighered.com/cs-resources.Commentsandsuggestions,aswellasreportsoferrors,canbesenttotheauthorsthroughdmbook@cs.umn.edu.

Acknowledgments

Manypeoplecontributedtothefirstandsecondeditionsofthebook.Webeginbyacknowledgingourfamiliestowhomthisbookisdedicated.Withouttheirpatienceandsupport,thisprojectwouldhavebeenimpossible.

WewouldliketothankthecurrentandformerstudentsofourdatamininggroupsattheUniversityofMinnesotaandMichiganStatefortheircontributions.Eui-Hong(Sam)HanandMaheshJoshihelpedwiththeinitialdataminingclasses.Someoftheexercisesandpresentationslidesthattheycreatedcanbefoundinthebookanditsaccompanyingslides.StudentsinourdatamininggroupswhoprovidedcommentsondraftsofthebookorwhocontributedinotherwaysincludeShyamBoriah,HaibinCheng,VarunChandola,EricEilertson,LeventErtöz,JingGao,RohitGupta,SridharIyer,Jung-EunLee,BenjaminMayer,AyselOzgur,UygarOztekin,GauravPandey,KashifRiaz,JerryScripps,GyorgySimon,HuiXiong,JiepingYe,andPushengZhang.WewouldalsoliketothankthestudentsofourdataminingclassesattheUniversityofMinnesotaandMichiganStateUniversitywhoworkedwithearlydraftsofthebookandprovidedinvaluablefeedback.WespecificallynotethehelpfulsuggestionsofBernardoCraemer,ArifinRuslim,JamshidVayghan,andYuWei.

JoydeepGhosh(UniversityofTexas)andSanjayRanka(UniversityofFlorida)classtestedearlyversionsofthebook.WealsoreceivedmanyusefulsuggestionsdirectlyfromthefollowingUTstudents:PankajAdhikari,RajivBhatia,FredericBosche,ArindamChakraborty,MeghanaDeodhar,ChrisEverson,DavidGardner,SaadGodil,ToddHay,ClintJones,AjayJoshi,JoonsooLee,YueLuo,AnujNanavati,TylerOlsen,SunyoungPark,AashishPhansalkar,GeoffPrewett,MichaelRyoo,DarylShannon,andMeiYang.

RonaldKostoff(ONR)readanearlyversionoftheclusteringchapterandofferednumeroussuggestions.GeorgeKarypisprovidedinvaluableLATEXassistanceincreatinganauthorindex.IreneMoulitsasalsoprovided

assistancewithLATEXandreviewedsomeoftheappendices.MusettaSteinbachwasveryhelpfulinfindingerrorsinthefigures.

WewouldliketoacknowledgeourcolleaguesattheUniversityofMinnesotaandMichiganStatewhohavehelpedcreateapositiveenvironmentfordataminingresearch.TheyincludeArindamBanerjee,DanBoley,JoyceChai,AnilJain,RaviJanardan,RongJin,GeorgeKarypis,ClaudiaNeuhauser,HaesunPark,WilliamF.Punch,GyörgySimon,ShashiShekhar,andJaideepSrivastava.Thecollaboratorsonourmanydataminingprojects,whoalsohaveourgratitude,includeRameshAgrawal,ManeeshBhargava,SteveCannon,AlokChoudhary,ImmeEbert-Uphoff,AuroopGanguly,PietC.deGroen,FranHill,YongdaeKim,SteveKlooster,KerryLong,NiharMahapatra,RamaNemani,NikunjOza,ChrisPotter,LisianePruinelli,NagizaSamatova,JonathanShapiro,KevinSilverstein,BrianVanNess,BonnieWestra,NevinYoung,andZhi-LiZhang.

ThedepartmentsofComputerScienceandEngineeringattheUniversityofMinnesotaandMichiganStateUniversityprovidedcomputingresourcesandasupportiveenvironmentforthisproject.ARDA,ARL,ARO,DOE,NASA,NOAA,andNSFprovidedresearchsupportforPang-NingTan,MichaelStein-bach,AnujKarpatne,andVipinKumar.Inparticular,KamalAbdali,MitraBasu,DickBrackney,JagdishChandra,JoeCoughlan,MichaelCoyle,StephenDavis,FredericaDarema,RichardHirsch,ChandrikaKamath,TsengdarLee,RajuNamburu,N.Radhakrishnan,JamesSidoran,SylviaSpengler,BhavaniThuraisingham,WaltTiernin,MariaZemankova,AidongZhang,andXiaodongZhanghavebeensupportiveofourresearchindataminingandhigh-performancecomputing.

ItwasapleasureworkingwiththehelpfulstaffatPearsonEducation.Inparticular,wewouldliketothankMattGoldstein,KathySmith,CaroleSnyder,

andJoyceWells.WewouldalsoliketothankGeorgeNichols,whohelpedwiththeartworkandPaulAnagnostopoulos,whoprovidedLATEXsupport.

WearegratefultothefollowingPearsonreviewers:LemanAkoglu(CarnegieMellonUniversity),Chien-ChungChan(UniversityofAkron),ZhengxinChen(UniversityofNebraskaatOmaha),ChrisClifton(PurdueUniversity),Joy-deepGhosh(UniversityofTexas,Austin),NazliGoharian(IllinoisInstituteofTechnology),J.MichaelHardin(UniversityofAlabama),JingruiHe(ArizonaStateUniversity),JamesHearne(WesternWashingtonUniversity),HillolKargupta(UniversityofMaryland,BaltimoreCountyandAgnik,LLC),EamonnKeogh(UniversityofCalifornia-Riverside),BingLiu(UniversityofIllinoisatChicago),MariofannaMilanova(UniversityofArkansasatLittleRock),SrinivasanParthasarathy(OhioStateUniversity),ZbigniewW.Ras(UniversityofNorthCarolinaatCharlotte),XintaoWu(UniversityofNorthCarolinaatCharlotte),andMohammedJ.Zaki(RensselaerPolytechnicInstitute).

Overtheyearssincethefirstedition,wehavealsoreceivednumerouscommentsfromreadersandstudentswhohavepointedouttyposandvariousotherissues.Weareunabletomentiontheseindividualsbyname,buttheirinputismuchappreciatedandhasbeentakenintoaccountforthesecondedition.

ContentsPrefacetotheSecondEditionv

1Introduction11.1WhatIsDataMining?4

1.2MotivatingChallenges5

1.3TheOriginsofDataMining7

1.4DataMiningTasks9

1.5ScopeandOrganizationoftheBook13

1.6BibliographicNotes15

1.7Exercises21

2Data232.1TypesofData26

2.1.1AttributesandMeasurement27

2.1.2TypesofDataSets34

2.2DataQuality422.2.1MeasurementandDataCollectionIssues42

2.2.2IssuesRelatedtoApplications49

2.3DataPreprocessing502.3.1Aggregation51

2.3.2Sampling52

2.3.3DimensionalityReduction56

2.3.4FeatureSubsetSelection58

2.3.5FeatureCreation61

2.3.6DiscretizationandBinarization63

2.3.7VariableTransformation69

2.4MeasuresofSimilarityandDissimilarity712.4.1Basics72

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes74

2.4.3DissimilaritiesbetweenDataObjects76

2.4.4SimilaritiesbetweenDataObjects78

2.4.5ExamplesofProximityMeasures79

2.4.6MutualInformation88

2.4.7KernelFunctions*90

2.4.8BregmanDivergence*94

2.4.9IssuesinProximityCalculation96

2.4.10SelectingtheRightProximityMeasure98

2.5BibliographicNotes100

2.6Exercises105

3Classification:BasicConceptsandTechniques113

3.1BasicConcepts114

3.2GeneralFrameworkforClassification117

3.3DecisionTreeClassifier1193.3.1ABasicAlgorithmtoBuildaDecisionTree121

3.3.2MethodsforExpressingAttributeTestConditions124

3.3.3MeasuresforSelectinganAttributeTestCondition127

3.3.4AlgorithmforDecisionTreeInduction136

3.3.5ExampleApplication:WebRobotDetection138

3.3.6CharacteristicsofDecisionTreeClassifiers140

3.4ModelOverfitting1473.4.1ReasonsforModelOverfitting149

3.5ModelSelection1563.5.1UsingaValidationSet156

3.5.2IncorporatingModelComplexity157

3.5.3EstimatingStatisticalBounds162

3.5.4ModelSelectionforDecisionTrees162

3.6ModelEvaluation1643.6.1HoldoutMethod165

3.6.2Cross-Validation165

3.7PresenceofHyper-parameters1683.7.1Hyper-parameterSelection168

3.7.2NestedCross-Validation170

3.8PitfallsofModelSelectionandEvaluation1723.8.1OverlapbetweenTrainingandTestSets172

3.8.2UseofValidationErrorasGeneralizationError172

3.9ModelComparison 1733.9.1EstimatingtheConfidenceIntervalforAccuracy174

3.9.2ComparingthePerformanceofTwoModels175

3.10BibliographicNotes176

3.11Exercises185

4Classification:AlternativeTechniques1934.1TypesofClassifiers193

4.2Rule-BasedClassifier1954.2.1HowaRule-BasedClassifierWorks197

4.2.2PropertiesofaRuleSet198

4.2.3DirectMethodsforRuleExtraction199

4.2.4IndirectMethodsforRuleExtraction204

4.2.5CharacteristicsofRule-BasedClassifiers206

4.3NearestNeighborClassifiers2084.3.1Algorithm209

4.3.2CharacteristicsofNearestNeighborClassifiers210

*

4.4NaïveBayesClassifier2124.4.1BasicsofProbabilityTheory213

4.4.2NaïveBayesAssumption218

4.5BayesianNetworks2274.5.1GraphicalRepresentation227

4.5.2InferenceandLearning233

4.5.3CharacteristicsofBayesianNetworks242

4.6LogisticRegression2434.6.1LogisticRegressionasaGeneralizedLinearModel244

4.6.2LearningModelParameters245

4.6.3CharacteristicsofLogisticRegression248

4.7ArtificialNeuralNetwork(ANN)2494.7.1Perceptron250

4.7.2Multi-layerNeuralNetwork254

4.7.3CharacteristicsofANN261

4.8DeepLearning2624.8.1UsingSynergisticLossFunctions263

4.8.2UsingResponsiveActivationFunctions266

4.8.3Regularization268

4.8.4InitializationofModelParameters271

4.8.5CharacteristicsofDeepLearning275

4.9SupportVectorMachine(SVM)2764.9.1MarginofaSeparatingHyperplane276

4.9.2LinearSVM278

4.9.3Soft-marginSVM284

4.9.4NonlinearSVM290

4.9.5CharacteristicsofSVM294

4.10EnsembleMethods2964.10.1RationaleforEnsembleMethod297

4.10.2MethodsforConstructinganEnsembleClassifier297

4.10.3Bias-VarianceDecomposition300

4.10.4Bagging302

4.10.5Boosting305

4.10.6RandomForests310

4.10.7EmpiricalComparisonamongEnsembleMethods312

4.11ClassImbalanceProblem3134.11.1BuildingClassifierswithClassImbalance314

4.11.2EvaluatingPerformancewithClassImbalance318

4.11.3FindinganOptimalScoreThreshold322

4.11.4AggregateEvaluationofPerformance323

4.12MulticlassProblem330

4.13BibliographicNotes333

4.14Exercises345

5AssociationAnalysis:BasicConceptsandAlgorithms3575.1Preliminaries358

5.2FrequentItemsetGeneration3625.2.1TheAprioriPrinciple363

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm364

5.2.3CandidateGenerationandPruning368

5.2.4SupportCounting373

5.2.5ComputationalComplexity377

5.3RuleGeneration3805.3.1Confidence-BasedPruning380

5.3.2RuleGenerationinAprioriAlgorithm381

5.3.3AnExample:CongressionalVotingRecords382

5.4CompactRepresentationofFrequentItemsets3845.4.1MaximalFrequentItemsets384

5.4.2ClosedItemsets386

5.5AlternativeMethodsforGeneratingFrequentItemsets*389

5.6FP-GrowthAlgorithm*3935.6.1FP-TreeRepresentation394

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm397

5.7EvaluationofAssociationPatterns401

5.7.1ObjectiveMeasuresofInterestingness402

5.7.2MeasuresbeyondPairsofBinaryVariables414

5.7.3Simpson’sParadox416

5.8EffectofSkewedSupportDistribution418

5.9BibliographicNotes424

5.10Exercises438

6AssociationAnalysis:AdvancedConcepts4516.1HandlingCategoricalAttributes451

6.2HandlingContinuousAttributes4546.2.1Discretization-BasedMethods454

6.2.2Statistics-BasedMethods458

6.2.3Non-discretizationMethods460

6.3HandlingaConceptHierarchy462

6.4SequentialPatterns4646.4.1Preliminaries465

6.4.2SequentialPatternDiscovery468

6.4.3TimingConstraints 473

6.4.4AlternativeCountingSchemes 477

6.5SubgraphPatterns4796.5.1Preliminaries480

6.5.2FrequentSubgraphMining483

6.5.3CandidateGeneration487

6.5.4CandidatePruning493

6.5.5SupportCounting493

6.6InfrequentPatterns 4936.6.1NegativePatterns494

6.6.2NegativelyCorrelatedPatterns495

6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns496

6.6.4TechniquesforMiningInterestingInfrequentPatterns498

6.6.5TechniquesBasedonMiningNegativePatterns499

6.6.6TechniquesBasedonSupportExpectation501

6.7BibliographicNotes505

6.8Exercises510

7ClusterAnalysis:BasicConceptsandAlgorithms5257.1Overview528

7.1.1WhatIsClusterAnalysis?528

7.1.2DifferentTypesofClusterings529

7.1.3DifferentTypesofClusters531

7.2K-means5347.2.1TheBasicK-meansAlgorithm535

7.2.2K-means:AdditionalIssues544

7.2.3BisectingK-means547

7.2.4K-meansandDifferentTypesofClusters548

7.2.5StrengthsandWeaknesses549

7.2.6K-meansasanOptimizationProblem549

7.3AgglomerativeHierarchicalClustering5547.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm555

7.3.2SpecificTechniques557

7.3.3TheLance-WilliamsFormulaforClusterProximity562

7.3.4KeyIssuesinHierarchicalClustering563

7.3.5Outliers564

7.3.6StrengthsandWeaknesses565

7.4DBSCAN5657.4.1TraditionalDensity:Center-BasedApproach565

7.4.2TheDBSCANAlgorithm567

7.4.3StrengthsandWeaknesses569

7.5ClusterEvaluation5717.5.1Overview571

7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation574

7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix582

7.5.4UnsupervisedEvaluationofHierarchicalClustering585

7.5.5DeterminingtheCorrectNumberofClusters587

7.5.6ClusteringTendency588

7.5.7SupervisedMeasuresofClusterValidity589

7.5.8AssessingtheSignificanceofClusterValidityMeasures594

7.5.9ChoosingaClusterValidityMeasure596

7.6BibliographicNotes597

7.7Exercises603

8ClusterAnalysis:AdditionalIssuesandAlgorithms6138.1CharacteristicsofData,Clusters,andClusteringAlgorithms614

8.1.1Example:ComparingK-meansandDBSCAN614

8.1.2DataCharacteristics615

8.1.3ClusterCharacteristics617

8.1.4GeneralCharacteristicsofClusteringAlgorithms619

8.2Prototype-BasedClustering6218.2.1FuzzyClustering621

8.2.2ClusteringUsingMixtureModels627

8.2.3Self-OrganizingMaps(SOM)637

8.3Density-BasedClustering6448.3.1Grid-BasedClustering644

8.3.2SubspaceClustering648

8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering652

8.4Graph-BasedClustering6568.4.1Sparsification657

8.4.2MinimumSpanningTree(MST)Clustering658

8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS659

8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling660

8.4.5SpectralClustering666

8.4.6SharedNearestNeighborSimilarity673

8.4.7TheJarvis-PatrickClusteringAlgorithm676

8.4.8SNNDensity678

8.4.9SNNDensity-BasedClustering679

8.5ScalableClusteringAlgorithms6818.5.1Scalability:GeneralIssuesandApproaches681

8.5.2BIRCH684

8.5.3CURE686

8.6WhichClusteringAlgorithm?690

8.7BibliographicNotes693

8.8Exercises699

9AnomalyDetection7039.1CharacteristicsofAnomalyDetectionProblems705

9.1.1ADefinitionofanAnomaly705

9.1.2NatureofData706

9.1.3HowAnomalyDetectionisUsed707

9.2CharacteristicsofAnomalyDetectionMethods708

9.3StatisticalApproaches7109.3.1UsingParametricModels710

9.3.2UsingNon-parametricModels714

9.3.3ModelingNormalandAnomalousClasses715

9.3.4AssessingStatisticalSignificance717

9.3.5StrengthsandWeaknesses718

9.4Proximity-basedApproaches7199.4.1Distance-basedAnomalyScore719

9.4.2Density-basedAnomalyScore720

9.4.3RelativeDensity-basedAnomalyScore722

9.4.4StrengthsandWeaknesses723

9.5Clustering-basedApproaches7249.5.1FindingAnomalousClusters724

9.5.2FindingAnomalousInstances725

9.5.3StrengthsandWeaknesses728

9.6Reconstruction-basedApproaches7289.6.1StrengthsandWeaknesses731

9.7One-classClassification7329.7.1UseofKernels733

9.7.2TheOriginTrick734

9.7.3StrengthsandWeaknesses738

9.8InformationTheoreticApproaches7389.8.1StrengthsandWeaknesses740

9.9EvaluationofAnomalyDetection740

9.10BibliographicNotes742

9.11Exercises749

10AvoidingFalseDiscoveries75510.1Preliminaries:StatisticalTesting756

10.1.1SignificanceTesting756

10.1.2HypothesisTesting761

10.1.3MultipleHypothesisTesting767

10.1.4PitfallsinStatisticalTesting776

10.2ModelingNullandAlternativeDistributions77810.2.1GeneratingSyntheticDataSets781

10.2.2RandomizingClassLabels782

10.2.3ResamplingInstances782

10.2.4ModelingtheDistributionoftheTestStatistic783

10.3StatisticalTestingforClassification78310.3.1EvaluatingClassificationPerformance783

10.3.2BinaryClassificationasMultipleHypothesisTesting785

10.3.3MultipleHypothesisTestinginModelSelection786

10.4StatisticalTestingforAssociationAnalysis78710.4.1UsingStatisticalModels788

10.4.2UsingRandomizationMethods794

10.5StatisticalTestingforClusterAnalysis79510.5.1GeneratingaNullDistributionforInternalIndices796

10.5.2GeneratingaNullDistributionforExternalIndices798

10.5.3Enrichment798

10.6StatisticalTestingforAnomalyDetection800

10.7BibliographicNotes803

10.8Exercises808

AuthorIndex816

SubjectIndex829

CopyrightPermissions839

1Introduction

Rapidadvancesindatacollectionandstoragetechnology,coupledwiththeeasewithwhichdatacanbegeneratedanddisseminated,havetriggeredtheexplosivegrowthofdata,leadingtothecurrentageofbigdata.Derivingactionableinsightsfromtheselargedatasetsisincreasinglyimportantindecisionmakingacrossalmostallareasofsociety,includingbusinessandindustry;scienceandengineering;medicineandbiotechnology;andgovernmentandindividuals.However,theamountofdata(volume),itscomplexity(variety),andtherateatwhichitisbeingcollectedandprocessed(velocity)havesimplybecometoogreatforhumanstoanalyzeunaided.Thus,thereisagreatneedforautomatedtoolsforextractingusefulinformationfromthebigdatadespitethechallengesposedbyitsenormityanddiversity.

Dataminingblendstraditionaldataanalysismethodswithsophisticatedalgorithmsforprocessingthisabundanceofdata.Inthisintroductorychapter,wepresentanoverviewofdataminingandoutlinethekeytopicstobecoveredinthisbook.Westartwitha

descriptionofsomeapplicationsthatrequiremoreadvancedtechniquesfordataanalysis.

BusinessandIndustryPoint-of-saledatacollection(barcodescanners,radiofrequencyidentification(RFID),andsmartcardtechnology)haveallowedretailerstocollectup-to-the-minutedataaboutcustomerpurchasesatthecheckoutcountersoftheirstores.Retailerscanutilizethisinformation,alongwithotherbusiness-criticaldata,suchaswebserverlogsfrome-commercewebsitesandcustomerservicerecordsfromcallcenters,tohelpthembetterunderstandtheneedsoftheircustomersandmakemoreinformedbusinessdecisions.

Dataminingtechniquescanbeusedtosupportawiderangeofbusinessintelligenceapplications,suchascustomerprofiling,targetedmarketing,workflowmanagement,storelayout,frauddetection,andautomatedbuyingandselling.Anexampleofthelastapplicationishigh-speedstocktrading,wheredecisionsonbuyingandsellinghavetobemadeinlessthanasecondusingdataaboutfinancialtransactions.Dataminingcanalsohelpretailersanswerimportantbusinessquestions,suchas“Whoarethemostprofitablecustomers?”“Whatproductscanbecross-soldorup-sold?”and“Whatistherevenueoutlookofthecompanyfornextyear?”Thesequestionshaveinspiredthedevelopmentofsuchdataminingtechniquesasassociationanalysis(Chapters5 and6 ).

AstheInternetcontinuestorevolutionizethewayweinteractandmakedecisionsinoureverydaylives,wearegeneratingmassiveamountsofdataaboutouronlineexperiences,e.g.,webbrowsing,messaging,andpostingonsocialnetworkingwebsites.Thishasopenedseveralopportunitiesforbusinessapplicationsthatusewebdata.Forexample,inthee-commercesector,dataaboutouronlineviewingorshoppingpreferencescanbeusedto

providepersonalizedrecommendationsofproducts.DataminingalsoplaysaprominentroleinsupportingseveralotherInternet-basedservices,suchasfilteringspammessages,answeringsearchqueries,andsuggestingsocialupdatesandconnections.Thelargecorpusoftext,images,andvideosavailableontheInternethasenabledanumberofadvancementsindataminingmethods,includingdeeplearning,whichisdiscussedinChapter4 .Thesedevelopmentshaveledtogreatadvancesinanumberofapplications,suchasobjectrecognition,naturallanguagetranslation,andautonomousdriving.

Anotherdomainthathasundergonearapidbigdatatransformationistheuseofmobilesensorsanddevices,suchassmartphonesandwearablecomputingdevices.Withbettersensortechnologies,ithasbecomepossibletocollectavarietyofinformationaboutourphysicalworldusinglow-costsensorsembeddedoneverydayobjectsthatareconnectedtoeachother,termedtheInternetofThings(IOT).Thisdeepintegrationofphysicalsensorsindigitalsystemsisbeginningtogeneratelargeamountsofdiverseanddistributeddataaboutourenvironment,whichcanbeusedfordesigningconvenient,safe,andenergy-efficienthomesystems,aswellasforurbanplanningofsmartcities.

Medicine,Science,andEngineeringResearchersinmedicine,science,andengineeringarerapidlyaccumulatingdatathatiskeytosignificantnewdiscoveries.Forexample,asanimportantsteptowardimprovingourunderstandingoftheEarth’sclimatesystem,NASAhasdeployedaseriesofEarth-orbitingsatellitesthatcontinuouslygenerateglobalobservationsofthelandsurface,oceans,andatmosphere.However,becauseofthesizeandspatio-temporalnatureofthedata,traditionalmethodsareoftennotsuitableforanalyzingthesedatasets.TechniquesdevelopedindataminingcanaidEarthscientistsinansweringquestionssuchasthefollowing:“Whatistherelationshipbetweenthefrequencyandintensityofecosystemdisturbances

suchasdroughtsandhurricanestoglobalwarming?”“Howislandsurfaceprecipitationandtemperatureaffectedbyoceansurfacetemperature?”and“Howwellcanwepredictthebeginningandendofthegrowingseasonforaregion?”

Asanotherexample,researchersinmolecularbiologyhopetousethelargeamountsofgenomicdatatobetterunderstandthestructureandfunctionofgenes.Inthepast,traditionalmethodsinmolecularbiologyallowedscientiststostudyonlyafewgenesatatimeinagivenexperiment.Recentbreakthroughsinmicroarraytechnologyhaveenabledscientiststocomparethebehaviorofthousandsofgenesundervarioussituations.Suchcomparisonscanhelpdeterminethefunctionofeachgene,andperhapsisolatethegenesresponsibleforcertaindiseases.However,thenoisy,high-dimensionalnatureofdatarequiresnewdataanalysismethods.Inadditiontoanalyzinggeneexpressiondata,dataminingcanalsobeusedtoaddressotherimportantbiologicalchallengessuchasproteinstructureprediction,multiplesequencealignment,themodelingofbiochemicalpathways,andphylogenetics.

Anotherexampleistheuseofdataminingtechniquestoanalyzeelectronichealthrecord(EHR)data,whichhasbecomeincreasinglyavailable.Notverylongago,studiesofpatientsrequiredmanuallyexaminingthephysicalrecordsofindividualpatientsandextractingveryspecificpiecesofinformationpertinenttotheparticularquestionbeinginvestigated.EHRsallowforafasterandbroaderexplorationofsuchdata.However,therearesignificantchallengessincetheobservationsonanyonepatienttypicallyoccurduringtheirvisitstoadoctororhospitalandonlyasmallnumberofdetailsaboutthehealthofthepatientaremeasuredduringanyparticularvisit.

Currently,EHRanalysisfocusesonsimpletypesofdata,e.g.,apatient’sbloodpressureorthediagnosiscodeofadisease.However,largeamountsof

morecomplextypesofmedicaldataarealsobeingcollected,suchaselectrocardiograms(ECGs)andneuroimagesfrommagneticresonanceimaging(MRI)orfunctionalMagneticResonanceImaging(fMRI).Althoughchallengingtoanalyze,thisdataalsoprovidesvitalinformationaboutpatients.Integratingandanalyzingsuchdata,withtraditionalEHRandgenomicdataisoneofthecapabilitiesneededtoenableprecisionmedicine,whichaimstoprovidemorepersonalizedpatientcare.

1.1WhatIsDataMining?Dataminingistheprocessofautomaticallydiscoveringusefulinformationinlargedatarepositories.Dataminingtechniquesaredeployedtoscourlargedatasetsinordertofindnovelandusefulpatternsthatmightotherwiseremainunknown.Theyalsoprovidethecapabilitytopredicttheoutcomeofafutureobservation,suchastheamountacustomerwillspendatanonlineorabrick-and-mortarstore.

Notallinformationdiscoverytasksareconsideredtobedatamining.Examplesincludequeries,e.g.,lookingupindividualrecordsinadatabaseorfindingwebpagesthatcontainaparticularsetofkeywords.Thisisbecausesuchtaskscanbeaccomplishedthroughsimpleinteractionswithadatabasemanagementsystemoraninformationretrievalsystem.Thesesystemsrelyontraditionalcomputersciencetechniques,whichincludesophisticatedindexingstructuresandqueryprocessingalgorithms,forefficientlyorganizingandretrievinginformationfromlargedatarepositories.Nonetheless,dataminingtechniqueshavebeenusedtoenhancetheperformanceofsuchsystemsbyimprovingthequalityofthesearchresultsbasedontheirrelevancetotheinputqueries.

DataMiningandKnowledgeDiscoveryinDatabasesDataminingisanintegralpartofknowledgediscoveryindatabases(KDD),whichistheoverallprocessofconvertingrawdataintousefulinformation,asshowninFigure1.1 .Thisprocessconsistsofaseriesofsteps,fromdatapreprocessingtopostprocessingofdataminingresults.

Figure1.1.Theprocessofknowledgediscoveryindatabases(KDD).

Theinputdatacanbestoredinavarietyofformats(flatfiles,spreadsheets,orrelationaltables)andmayresideinacentralizeddatarepositoryorbedistributedacrossmultiplesites.Thepurposeofpreprocessingistotransformtherawinputdataintoanappropriateformatforsubsequentanalysis.Thestepsinvolvedindatapreprocessingincludefusingdatafrommultiplesources,cleaningdatatoremovenoiseandduplicateobservations,andselectingrecordsandfeaturesthatarerelevanttothedataminingtaskathand.Becauseofthemanywaysdatacanbecollectedandstored,datapreprocessingisperhapsthemostlaboriousandtime-consumingstepintheoverallknowledgediscoveryprocess.

“Closingtheloop”isaphraseoftenusedtorefertotheprocessofintegratingdataminingresultsintodecisionsupportsystems.Forexample,inbusinessapplications,theinsightsofferedbydataminingresultscanbeintegratedwithcampaignmanagementtoolssothateffectivemarketingpromotionscanbeconductedandtested.Suchintegrationrequiresapostprocessingsteptoensurethatonlyvalidandusefulresultsareincorporatedintothedecisionsupportsystem.Anexampleofpostprocessingisvisualization,whichallowsanalyststoexplorethedataandthedataminingresultsfromavarietyofviewpoints.Hypothesistestingmethodscanalsobeappliedduring

postprocessingtoeliminatespuriousdataminingresults.(SeeChapter10 .)

1.2MotivatingChallengesAsmentionedearlier,traditionaldataanalysistechniqueshaveoftenencounteredpracticaldifficultiesinmeetingthechallengesposedbybigdataapplications.Thefollowingaresomeofthespecificchallengesthatmotivatedthedevelopmentofdatamining.

Scalability

Becauseofadvancesindatagenerationandcollection,datasetswithsizesofterabytes,petabytes,orevenexabytesarebecomingcommon.Ifdataminingalgorithmsaretohandlethesemassivedatasets,theymustbescalable.Manydataminingalgorithmsemployspecialsearchstrategiestohandleexponentialsearchproblems.Scalabilitymayalsorequiretheimplementationofnoveldatastructurestoaccessindividualrecordsinanefficientmanner.Forinstance,out-of-corealgorithmsmaybenecessarywhenprocessingdatasetsthatcannotfitintomainmemory.Scalabilitycanalsobeimprovedbyusingsamplingordevelopingparallelanddistributedalgorithms.AgeneraloverviewoftechniquesforscalingupdataminingalgorithmsisgiveninAppendixF.

HighDimensionality

Itisnowcommontoencounterdatasetswithhundredsorthousandsofattributesinsteadofthehandfulcommonafewdecadesago.Inbioinformatics,progressinmicroarraytechnologyhasproducedgeneexpressiondatainvolvingthousandsoffeatures.Datasetswithtemporalorspatialcomponentsalsotendtohavehighdimensionality.Forexample,

consideradatasetthatcontainsmeasurementsoftemperatureatvariouslocations.Ifthetemperaturemeasurementsaretakenrepeatedlyforanextendedperiod,thenumberofdimensions(features)increasesinproportiontothenumberofmeasurementstaken.Traditionaldataanalysistechniquesthatweredevelopedforlow-dimensionaldataoftendonotworkwellforsuchhigh-dimensionaldataduetoissuessuchascurseofdimensionality(tobediscussedinChapter2 ).Also,forsomedataanalysisalgorithms,thecomputationalcomplexityincreasesrapidlyasthedimensionality(thenumberoffeatures)increases.

HeterogeneousandComplexData

Traditionaldataanalysismethodsoftendealwithdatasetscontainingattributesofthesametype,eithercontinuousorcategorical.Astheroleofdatamininginbusiness,science,medicine,andotherfieldshasgrown,sohastheneedfortechniquesthatcanhandleheterogeneousattributes.Recentyearshavealsoseentheemergenceofmorecomplexdataobjects.Examplesofsuchnon-traditionaltypesofdataincludewebandsocialmediadatacontainingtext,hyperlinks,images,audio,andvideos;DNAdatawithsequentialandthree-dimensionalstructure;andclimatedatathatconsistsofmeasurements(temperature,pressure,etc.)atvarioustimesandlocationsontheEarth’ssurface.Techniquesdevelopedforminingsuchcomplexobjectsshouldtakeintoconsiderationrelationshipsinthedata,suchastemporalandspatialautocorrelation,graphconnectivity,andparent-childrelationshipsbetweentheelementsinsemi-structuredtextandXMLdocuments.

DataOwnershipandDistribution

Sometimes,thedataneededforananalysisisnotstoredinonelocationorownedbyoneorganization.Instead,thedataisgeographicallydistributedamongresourcesbelongingtomultipleentities.Thisrequiresthedevelopment

ofdistributeddataminingtechniques.Thekeychallengesfacedbydistributeddataminingalgorithmsincludethefollowing:(1)howtoreducetheamountofcommunicationneededtoperformthedistributedcomputation,(2)howtoeffectivelyconsolidatethedataminingresultsobtainedfrommultiplesources,and(3)howtoaddressdatasecurityandprivacyissues.

Non-traditionalAnalysis

Thetraditionalstatisticalapproachisbasedonahypothesize-and-testparadigm.Inotherwords,ahypothesisisproposed,anexperimentisdesignedtogatherthedata,andthenthedataisanalyzedwithrespecttothehypothesis.Unfortunately,thisprocessisextremelylabor-intensive.Currentdataanalysistasksoftenrequirethegenerationandevaluationofthousandsofhypotheses,andconsequently,thedevelopmentofsomedataminingtechniqueshasbeenmotivatedbythedesiretoautomatetheprocessofhypothesisgenerationandevaluation.Furthermore,thedatasetsanalyzedindataminingaretypicallynottheresultofacarefullydesignedexperimentandoftenrepresentopportunisticsamplesofthedata,ratherthanrandomsamples.

1.3TheOriginsofDataMiningWhiledatamininghastraditionallybeenviewedasanintermediateprocesswithintheKDDframework,asshowninFigure1.1 ,ithasemergedovertheyearsasanacademicfieldwithincomputerscience,focusingonallaspectsofKDD,includingdatapreprocessing,mining,andpostprocessing.Itsorigincanbetracedbacktothelate1980s,followingaseriesofworkshopsorganizedonthetopicofknowledgediscoveryindatabases.Theworkshopsbroughttogetherresearchersfromdifferentdisciplinestodiscussthechallengesandopportunitiesinapplyingcomputationaltechniquestoextractactionableknowledgefromlargedatabases.Theworkshopsquicklygrewintohugelypopularconferencesthatwereattendedbyresearchersandpractitionersfromboththeacademiaandindustry.Thesuccessoftheseconferences,alongwiththeinterestshownbybusinessesandindustryinrecruitingnewhireswithdataminingbackground,havefueledthetremendousgrowthofthisfield.

Thefieldwasinitiallybuiltuponthemethodologyandalgorithmsthatresearchershadpreviouslyused.Inparticular,dataminingresearchersdrawuponideas,suchas(1)sampling,estimation,andhypothesistestingfromstatisticsand(2)searchalgorithms,modelingtechniques,andlearningtheoriesfromartificialintelligence,patternrecognition,andmachinelearning.Datamininghasalsobeenquicktoadoptideasfromotherareas,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval,andextendingthemtosolvethechallengesofminingbigdata.

Anumberofotherareasalsoplaykeysupportingroles.Inparticular,databasesystemsareneededtoprovidesupportforefficientstorage,indexing,andqueryprocessing.Techniquesfromhighperformance(parallel)computingare

oftenimportantinaddressingthemassivesizeofsomedatasets.Distributedtechniquescanalsohelpaddresstheissueofsizeandareessentialwhenthedatacannotbegatheredinonelocation.Figure1.2 showstherelationshipofdataminingtootherareas.

Figure1.2.Dataminingasaconfluenceofmanydisciplines.

DataScienceandData-DrivenDiscoveryDatascienceisaninterdisciplinaryfieldthatstudiesandappliestoolsandtechniquesforderivingusefulinsightsfromdata.Althoughdatascienceisregardedasanemergingfieldwithadistinctidentityofitsown,thetoolsandtechniquesoftencomefrommanydifferentareasofdataanalysis,suchasdatamining,statistics,AI,machinelearning,patternrecognition,databasetechnology,anddistributedandparallelcomputing.(SeeFigure1.2 .)

Theemergenceofdatascienceasanewfieldisarecognitionthat,often,noneoftheexistingareasofdataanalysisprovidesacompletesetoftoolsforthedataanalysistasksthatareoftenencounteredinemergingapplications.

Instead,abroadrangeofcomputational,mathematical,andstatisticalskillsisoftenrequired.Toillustratethechallengesthatariseinanalyzingsuchdata,considerthefollowingexample.SocialmediaandtheWebpresentnewopportunitiesforsocialscientiststoobserveandquantitativelymeasurehumanbehavioronalargescale.Toconductsuchastudy,socialscientistsworkwithanalystswhopossessskillsinareassuchaswebmining,naturallanguageprocessing(NLP),networkanalysis,datamining,andstatistics.Comparedtomoretraditionalresearchinsocialscience,whichisoftenbasedonsurveys,thisanalysisrequiresabroaderrangeofskillsandtools,andinvolvesfarlargeramountsofdata.Thus,datascienceis,bynecessity,ahighlyinterdisciplinaryfieldthatbuildsonthecontinuingworkofmanyfields.

Thedata-drivenapproachofdatascienceemphasizesthedirectdiscoveryofpatternsandrelationshipsfromdata,especiallyinlargequantitiesofdata,oftenwithouttheneedforextensivedomainknowledge.Anotableexampleofthesuccessofthisapproachisrepresentedbyadvancesinneuralnetworks,i.e.,deeplearning,whichhavebeenparticularlysuccessfulinareaswhichhavelongprovedchallenging,e.g.,recognizingobjectsinphotosorvideosandwordsinspeech,aswellasinotherapplicationareas.However,notethatthisisjustoneexampleofthesuccessofdata-drivenapproaches,anddramaticimprovementshavealsooccurredinmanyotherareasofdataanalysis.Manyofthesedevelopmentsaretopicsdescribedlaterinthisbook.

Somecautionsonpotentiallimitationsofapurelydata-drivenapproacharegivenintheBibliographicNotes.

1.4DataMiningTasksDataminingtasksaregenerallydividedintotwomajorcategories:

PredictivetasksTheobjectiveofthesetasksistopredictthevalueofaparticularattributebasedonthevaluesofotherattributes.Theattributetobepredictediscommonlyknownasthetargetordependentvariable,whiletheattributesusedformakingthepredictionareknownastheexplanatoryorindependentvariables.

DescriptivetasksHere,theobjectiveistoderivepatterns(correlations,trends,clusters,trajectories,andanomalies)thatsummarizetheunderlyingrelationshipsindata.Descriptivedataminingtasksareoftenexploratoryinnatureandfrequentlyrequirepostprocessingtechniquestovalidateandexplaintheresults.

Figure1.3 illustratesfourofthecoredataminingtasksthataredescribedintheremainderofthisbook.

Figure1.3.Fourofthecoredataminingtasks.

Predictivemodelingreferstothetaskofbuildingamodelforthetargetvariableasafunctionoftheexplanatoryvariables.Therearetwotypesofpredictivemodelingtasks:classification,whichisusedfordiscretetargetvariables,andregression,whichisusedforcontinuoustargetvariables.Forexample,predictingwhetherawebuserwillmakeapurchaseatanonlinebookstoreisaclassificationtaskbecausethetargetvariableisbinary-valued.Ontheotherhand,forecastingthefuturepriceofastockisaregressiontaskbecausepriceisacontinuous-valuedattribute.Thegoalofbothtasksistolearnamodelthatminimizestheerrorbetweenthepredictedandtruevaluesofthetargetvariable.Predictivemodelingcanbeusedtoidentifycustomerswhowillrespondtoamarketingcampaign,predictdisturbancesintheEarth’s

ecosystem,orjudgewhetherapatienthasaparticulardiseasebasedontheresultsofmedicaltests.

Example1.1(PredictingtheTypeofaFlower).Considerthetaskofpredictingaspeciesofflowerbasedonthecharacteristicsoftheflower.Inparticular,considerclassifyinganIrisflowerasoneofthefollowingthreeIrisspecies:Setosa,Versicolour,orVirginica.Toperformthistask,weneedadatasetcontainingthecharacteristicsofvariousflowersofthesethreespecies.Adatasetwiththistypeofinformationisthewell-knownIrisdatasetfromtheUCIMachineLearningRepositoryathttp://www.ics.uci.edu/~mlearn.Inadditiontothespeciesofaflower,thisdatasetcontainsfourotherattributes:sepalwidth,sepallength,petallength,andpetalwidth.Figure1.4 showsaplotofpetalwidthversuspetallengthforthe150flowersintheIrisdataset.Petalwidthisbrokenintothecategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,0.75),[0.75,1.75), ,respectively.Also,petallengthisbrokenintocategorieslow,medium,andhigh,whichcorrespondtotheintervals[0,2.5),[2.5,5), ,respectively.Basedonthesecategoriesofpetalwidthandlength,thefollowingrulescanbederived:

PetalwidthlowandpetallengthlowimpliesSetosa.

PetalwidthmediumandpetallengthmediumimpliesVersicolour.

PetalwidthhighandpetallengthhighimpliesVirginica.

Whiletheserulesdonotclassifyalltheflowers,theydoagood(butnotperfect)jobofclassifyingmostoftheflowers.NotethatflowersfromtheSetosaspeciesarewellseparatedfromtheVersicolourandVirginicaspecieswithrespecttopetalwidthandlength,butthelattertwospeciesoverlapsomewhatwithrespecttotheseattributes.

[1.75,∞)

[5,∞)

Figure1.4.Petalwidthversuspetallengthfor150Irisflowers.

Associationanalysisisusedtodiscoverpatternsthatdescribestronglyassociatedfeaturesinthedata.Thediscoveredpatternsaretypicallyrepresentedintheformofimplicationrulesorfeaturesubsets.Becauseoftheexponentialsizeofitssearchspace,thegoalofassociationanalysisistoextractthemostinterestingpatternsinanefficientmanner.Usefulapplicationsofassociationanalysisincludefindinggroupsofgenesthathaverelatedfunctionality,identifyingwebpagesthatareaccessedtogether,orunderstandingtherelationshipsbetweendifferentelementsofEarth’sclimatesystem.

Example1.2(MarketBasketAnalysis).

ThetransactionsshowninTable1.1 illustratepoint-of-saledatacollectedatthecheckoutcountersofagrocerystore.Associationanalysiscanbeappliedtofinditemsthatarefrequentlyboughttogetherbycustomers.Forexample,wemaydiscovertherule ,whichsuggeststhatcustomerswhobuydiapersalsotendtobuymilk.Thistypeofrulecanbeusedtoidentifypotentialcross-sellingopportunitiesamongrelateditems.

Table1.1.Marketbasketdata.

TransactionID Items

1 {Bread,Butter,Diapers,Milk}

2 {Coffee,Sugar,Cookies,Salmon}

3 {Bread,Butter,Coffee,Diapers,Milk,Eggs}

4 {Bread,Butter,Salmon,Chicken}

5 {Eggs,Bread,Butter}

6 {Salmon,Diapers,Milk}

7 {Bread,Tea,Sugar,Eggs}

8 {Coffee,Sugar,Chicken,Eggs}

9 {Bread,Diapers,Milk,Salt}

10 {Tea,Eggs,Cookies,Diapers,Milk}

Clusteranalysisseekstofindgroupsofcloselyrelatedobservationssothatobservationsthatbelongtothesameclusteraremoresimilartoeachotherthanobservationsthatbelongtootherclusters.Clusteringhasbeenusedto

{Diapers}→{Milk}

groupsetsofrelatedcustomers,findareasoftheoceanthathaveasignificantimpactontheEarth’sclimate,andcompressdata.

Example1.3(DocumentClustering).ThecollectionofnewsarticlesshowninTable1.2 canbegroupedbasedontheirrespectivetopics.Eacharticleisrepresentedasasetofword-frequencypairs(w:c),wherewisawordandcisthenumberoftimesthewordappearsinthearticle.Therearetwonaturalclustersinthedataset.Thefirstclusterconsistsofthefirstfourarticles,whichcorrespondtonewsabouttheeconomy,whilethesecondclustercontainsthelastfourarticles,whichcorrespondtonewsabouthealthcare.Agoodclusteringalgorithmshouldbeabletoidentifythesetwoclustersbasedonthesimilaritybetweenwordsthatappearinthearticles.

Table1.2.Collectionofnewsarticles.

Article Word-frequencypairs

1 dollar:1,industry:4,country:2,loan:3,deal:2,government:2

2 machinery:2,labor:3,market:4,industry:2,work:3,country:1

3 job:5,inflation:3,rise:2,jobless:2,market:3,country:2,index:3

4 domestic:3,forecast:2,gain:1,market:2,sale:3,price:2

5 patient:4,symptom:2,drug:3,health:2,clinic:2,doctor:2

6 pharmaceutical:2,company:3,drug:2,vaccine:1,flu:3

7 death:2,cancer:4,drug:3,public:4,health:3,director:2

8 medical:2,cost:3,increase:2,patient:2,health:3,care:1

Anomalydetectionisthetaskofidentifyingobservationswhosecharacteristicsaresignificantlydifferentfromtherestofthedata.Suchobservationsareknownasanomaliesoroutliers.Thegoalofananomalydetectionalgorithmistodiscovertherealanomaliesandavoidfalselylabelingnormalobjectsasanomalous.Inotherwords,agoodanomalydetectormusthaveahighdetectionrateandalowfalsealarmrate.Applicationsofanomalydetectionincludethedetectionoffraud,networkintrusions,unusualpatternsofdisease,andecosystemdisturbances,suchasdroughts,floods,fires,hurricanes,etc.

Example1.4(CreditCardFraudDetection).Acreditcardcompanyrecordsthetransactionsmadebyeverycreditcardholder,alongwithpersonalinformationsuchascreditlimit,age,annualincome,andaddress.Sincethenumberoffraudulentcasesisrelativelysmallcomparedtothenumberoflegitimatetransactions,anomalydetectiontechniquescanbeappliedtobuildaprofileoflegitimatetransactionsfortheusers.Whenanewtransactionarrives,itiscomparedagainsttheprofileoftheuser.Ifthecharacteristicsofthetransactionareverydifferentfromthepreviouslycreatedprofile,thenthetransactionisflaggedaspotentiallyfraudulent.

1.5ScopeandOrganizationoftheBookThisbookintroducesthemajorprinciplesandtechniquesusedindataminingfromanalgorithmicperspective.Astudyoftheseprinciplesandtechniquesisessentialfordevelopingabetterunderstandingofhowdataminingtechnologycanbeappliedtovariouskindsofdata.Thisbookalsoservesasastartingpointforreaderswhoareinterestedindoingresearchinthisfield.

Webeginthetechnicaldiscussionofthisbookwithachapterondata(Chapter2 ),whichdiscussesthebasictypesofdata,dataquality,preprocessingtechniques,andmeasuresofsimilarityanddissimilarity.Althoughthismaterialcanbecoveredquickly,itprovidesanessentialfoundationfordataanalysis.Chapters3 and4 coverclassification.Chapter3 providesafoundationbydiscussingdecisiontreeclassifiersandseveralissuesthatareimportanttoallclassification:overfitting,underfitting,modelselection,andperformanceevaluation.Usingthisfoundation,Chapter4 describesanumberofotherimportantclassificationtechniques:rule-basedsystems,nearestneighborclassifiers,Bayesianclassifiers,artificialneuralnetworks,includingdeeplearning,supportvectormachines,andensembleclassifiers,whicharecollectionsofclassifiers.Themulticlassandimbalancedclassproblemsarealsodiscussed.Thesetopicscanbecoveredindependently.

AssociationanalysisisexploredinChapters5 and6 .Chapter5describesthebasicsofassociationanalysis:frequentitemsets,associationrules,andsomeofthealgorithmsusedtogeneratethem.Specifictypesoffrequentitemsets—maximal,closed,andhyperclique—thatareimportantfor

dataminingarealsodiscussed,andthechapterconcludeswithadiscussionofevaluationmeasuresforassociationanalysis.Chapter6 considersavarietyofmoreadvancedtopics,includinghowassociationanalysiscanbeappliedtocategoricalandcontinuousdataortodatathathasaconcepthierarchy.(Aconcepthierarchyisahierarchicalcategorizationofobjects,e.g.,storeitems .)Thischapteralsodescribeshowassociationanalysiscanbeextendedtofindsequentialpatterns(patternsinvolvingorder),patternsingraphs,andnegativerelationships(ifoneitemispresent,thentheotherisnot).

ClusteranalysisisdiscussedinChapters7 and8 .Chapter7 firstdescribesthedifferenttypesofclusters,andthenpresentsthreespecificclusteringtechniques:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thisisfollowedbyadiscussionoftechniquesforvalidatingtheresultsofaclusteringalgorithm.AdditionalclusteringconceptsandtechniquesareexploredinChapter8 ,includingfuzzyandprobabilisticclustering,Self-OrganizingMaps(SOM),graph-basedclustering,spectralclustering,anddensity-basedclustering.Thereisalsoadiscussionofscalabilityissuesandfactorstoconsiderwhenselectingaclusteringalgorithm.

Chapter9 ,isonanomalydetection.Aftersomebasicdefinitions,severaldifferenttypesofanomalydetectionareconsidered:statistical,distance-based,density-based,clustering-based,reconstruction-based,one-classclassification,andinformationtheoretic.Thelastchapter,Chapter10 ,supplementsthediscussionsintheotherChapterswithadiscussionofthestatisticalconceptsimportantforavoidingspuriousresults,andthendiscussesthoseconceptsinthecontextofdataminingtechniquesstudiedinthepreviouschapters.Thesetechniquesincludestatisticalhypothesistesting,p-values,thefalsediscoveryrate,andpermutationtesting.AppendicesAthroughFgiveabriefreviewofimportanttopicsthatareusedinportionsof

storeitems→clothing→shoes→sneakers

thebook:linearalgebra,dimensionalityreduction,statistics,regression,optimization,andscalingupdataminingtechniquesforbigdata.

Thesubjectofdatamining,whilerelativelyyoungcomparedtostatisticsormachinelearning,isalreadytoolargetocoverinasinglebook.Selectedreferencestotopicsthatareonlybrieflycovered,suchasdataquality,areprovidedintheBibliographicNotessectionoftheappropriatechapter.Referencestotopicsnotcoveredinthisbook,suchasminingstreamingdataandprivacy-preservingdataminingareprovidedintheBibliographicNotesofthischapter.

1.6BibliographicNotesThetopicofdatamininghasinspiredmanytextbooks.IntroductorytextbooksincludethosebyDunham[16],Hanetal.[29],Handetal.[31],RoigerandGeatz[50],ZakiandMeira[61],andAggarwal[2].DataminingbookswithastrongeremphasisonbusinessapplicationsincludetheworksbyBerryandLinoff[5],Pyle[47],andParrRud[45].BookswithanemphasisonstatisticallearningincludethosebyCherkasskyandMulier[11],andHastieetal.[32].SimilarbookswithanemphasisonmachinelearningorpatternrecognitionarethosebyDudaetal.[15],Kantardzic[34],Mitchell[43],Webb[57],andWittenandFrank[58].Therearealsosomemorespecializedbooks:Chakrabarti[9](webmining),Fayyadetal.[20](collectionofearlyarticlesondatamining),Fayyadetal.[18](visualization),Grossmanetal.[25](scienceandengineering),KarguptaandChan[35](distributeddatamining),Wangetal.[56](bioinformatics),andZakiandHo[60](paralleldatamining).

Thereareseveralconferencesrelatedtodatamining.SomeofthemainconferencesdedicatedtothisfieldincludetheACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD),theIEEEInternationalConferenceonDataMining(ICDM),theSIAMInternationalConferenceonDataMining(SDM),theEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(PKDD),andthePacific-AsiaConferenceonKnowledgeDiscoveryandDataMining(PAKDD).DataminingpaperscanalsobefoundinothermajorconferencessuchastheConferenceandWorkshoponNeuralInformationProcessingSystems(NIPS),theInternationalConferenceonMachineLearning(ICML),theACMSIGMOD/PODSconference,theInternationalConferenceonVeryLargeDataBases(VLDB),theConferenceonInformationandKnowledgeManagement(CIKM),theInternationalConferenceonDataEngineering(ICDE),the

NationalConferenceonArtificialIntelligence(AAAI),theIEEEInternationalConferenceonBigData,andtheIEEEInternationalConferenceonDataScienceandAdvancedAnalytics(DSAA).

JournalpublicationsondataminingincludeIEEETransactionsonKnowledgeandDataEngineering,DataMiningandKnowledgeDiscovery,KnowledgeandInformationSystems,ACMTransactionsonKnowledgeDiscoveryfromData,StatisticalAnalysisandDataMining,andInformationSystems.Therearevariousopen-sourcedataminingsoftwareavailable,includingWeka[27]andScikit-learn[46].Morerecently,dataminingsoftwaresuchasApacheMahoutandApacheSparkhavebeendevelopedforlarge-scaleproblemsonthedistributedcomputingplatform.

Therehavebeenanumberofgeneralarticlesondataminingthatdefinethefieldoritsrelationshiptootherfields,particularlystatistics.Fayyadetal.[19]describedataminingandhowitfitsintothetotalknowledgediscoveryprocess.Chenetal.[10]giveadatabaseperspectiveondatamining.RamakrishnanandGrama[48]provideageneraldiscussionofdataminingandpresentseveralviewpoints.Hand[30]describeshowdataminingdiffersfromstatistics,asdoesFriedman[21].Lambert[40]explorestheuseofstatisticsforlargedatasetsandprovidessomecommentsontherespectiverolesofdataminingandstatistics.Glymouretal.[23]considerthelessonsthatstatisticsmayhavefordatamining.Smythetal.[53]describehowtheevolutionofdataminingisbeingdrivenbynewtypesofdataandapplications,suchasthoseinvolvingstreams,graphs,andtext.Hanetal.[28]consideremergingapplicationsindataminingandSmyth[52]describessomeresearchchallengesindatamining.Wuetal.[59]discusshowdevelopmentsindataminingresearchcanbeturnedintopracticaltools.DataminingstandardsarethesubjectofapaperbyGrossmanetal.[24].Bradley[7]discusseshowdataminingalgorithmscanbescaledtolargedatasets.

Theemergenceofnewdataminingapplicationshasproducednewchallengesthatneedtobeaddressed.Forinstance,concernsaboutprivacybreachesasaresultofdatamininghaveescalatedinrecentyears,particularlyinapplicationdomainssuchaswebcommerceandhealthcare.Asaresult,thereisgrowinginterestindevelopingdataminingalgorithmsthatmaintainuserprivacy.Developingtechniquesforminingencryptedorrandomizeddataisknownasprivacy-preservingdatamining.SomegeneralreferencesinthisareaincludepapersbyAgrawalandSrikant[3],Cliftonetal.[12]andKarguptaetal.[36].Vassiliosetal.[55]provideasurvey.Anotherareaofconcernisthebiasinpredictivemodelsthatmaybeusedforsomeapplications,e.g.,screeningjobapplicantsordecidingprisonparole[39].Assessingwhethersuchapplicationsareproducingbiasedresultsismademoredifficultbythefactthatthepredictivemodelsusedforsuchapplicationsareoftenblackboxmodels,i.e.,modelsthatarenotinterpretableinanystraightforwardway.

Datascience,itsconstituentfields,andmoregenerally,thenewparadigmofknowledgediscoverytheyrepresent[33],havegreatpotential,someofwhichhasbeenrealized.However,itisimportanttoemphasizethatdatascienceworksmostlywithobservationaldata,i.e.,datathatwascollectedbyvariousorganizationsaspartoftheirnormaloperation.Theconsequenceofthisisthatsamplingbiasesarecommonandthedeterminationofcausalfactorsbecomesmoreproblematic.Forthisandanumberofotherreasons,itisoftenhardtointerpretthepredictivemodelsbuiltfromthisdata[42,49].Thus,theory,experimentationandcomputationalsimulationswillcontinuetobethemethodsofchoiceinmanyareas,especiallythoserelatedtoscience.

Moreimportantly,apurelydata-drivenapproachoftenignorestheexistingknowledgeinaparticularfield.Suchmodelsmayperformpoorly,forexample,predictingimpossibleoutcomesorfailingtogeneralizetonewsituations.However,ifthemodeldoesworkwell,e.g.,hashighpredictiveaccuracy,then

thisapproachmaybesufficientforpracticalpurposesinsomefields.Butinmanyareas,suchasmedicineandscience,gaininginsightintotheunderlyingdomainisoftenthegoal.Somerecentworkattemptstoaddresstheseissuesinordertocreatetheory-guideddatascience,whichtakespre-existingdomainknowledgeintoaccount[17,37].

Recentyearshavewitnessedagrowingnumberofapplicationsthatrapidlygeneratecontinuousstreamsofdata.Examplesofstreamdataincludenetworktraffic,multimediastreams,andstockprices.Severalissuesmustbeconsideredwhenminingdatastreams,suchasthelimitedamountofmemoryavailable,theneedforonlineanalysis,andthechangeofthedataovertime.Dataminingforstreamdatahasbecomeanimportantareaindatamining.SomeselectedpublicationsareDomingosandHulten[14](classification),Giannellaetal.[22](associationanalysis),Guhaetal.[26](clustering),Kiferetal.[38](changedetection),Papadimitriouetal.[44](timeseries),andLawetal.[41](dimensionalityreduction).

Anotherareaofinterestisrecommenderandcollaborativefilteringsystems[1,6,8,13,54],whichsuggestmovies,televisionshows,books,products,etc.thatapersonmightlike.Inmanycases,thisproblem,oratleastacomponentofit,istreatedasapredictionproblemandthus,dataminingtechniquescanbeapplied[4,51].

Bibliography[1]G.AdomaviciusandA.Tuzhilin.Towardthenextgenerationof

recommendersystems:Asurveyofthestate-of-the-artandpossibleextensions.IEEEtransactionsonknowledgeanddataengineering,17(6):734–749,2005.

[2]C.Aggarwal.Datamining:TheTextbook.Springer,2009.

[3]R.AgrawalandR.Srikant.Privacy-preservingdatamining.InProc.of2000ACMSIGMODIntl.Conf.onManagementofData,pages439–450,Dallas,Texas,2000.ACMPress.

[4]X.AmatriainandJ.M.Pujol.Dataminingmethodsforrecommendersystems.InRecommenderSystemsHandbook,pages227–262.Springer,2015.

[5]M.J.A.BerryandG.Linoff.DataMiningTechniques:ForMarketing,Sales,andCustomerRelationshipManagement.WileyComputerPublishing,2ndedition,2004.

[6]J.Bobadilla,F.Ortega,A.Hernando,andA.Gutiérrez.Recommendersystemssurvey.Knowledge-basedsystems,46:109–132,2013.

[7]P.S.Bradley,J.Gehrke,R.Ramakrishnan,andR.Srikant.Scalingminingalgorithmstolargedatabases.CommunicationsoftheACM,45(8):38–43,2002.

[8]R.Burke.Hybridrecommendersystems:Surveyandexperiments.Usermodelinganduser-adaptedinteraction,12(4):331–370,2002.

[9]S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,SanFrancisco,CA,2003.

[10]M.-S.Chen,J.Han,andP.S.Yu.DataMining:AnOverviewfromaDatabasePerspective.IEEETransactionsonKnowledgeandDataEngineering,8(6):866–883,1996.

[11]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley-IEEEPress,2ndedition,1998.

[12]C.Clifton,M.Kantarcioglu,andJ.Vaidya.Definingprivacyfordatamining.InNationalScienceFoundationWorkshoponNextGenerationDataMining,pages126–133,Baltimore,MD,November2002.

[13]C.DesrosiersandG.Karypis.Acomprehensivesurveyofneighborhood-basedrecommendationmethods.Recommendersystemshandbook,pages107–144,2011.

[14]P.DomingosandG.Hulten.Mininghigh-speeddatastreams.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,

Boston,Massachusetts,2000.ACMPress.

[15]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley…Sons,Inc.,NewYork,2ndedition,2001.

[16]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[17]J.H.Faghmous,A.Banerjee,S.Shekhar,M.Steinbach,V.Kumar,A.R.Ganguly,andN.Samatova.Theory-guideddatascienceforclimatechange.Computer,47(11):74–78,2014.

[18]U.M.Fayyad,G.G.Grinstein,andA.Wierse,editors.InformationVisualizationinDataMiningandKnowledgeDiscovery.MorganKaufmannPublishers,SanFrancisco,CA,September2001.

[19]U.M.Fayyad,G.Piatetsky-Shapiro,andP.Smyth.FromDataMiningtoKnowledgeDiscovery:AnOverview.InAdvancesinKnowledgeDiscoveryandDataMining,pages1–34.AAAIPress,1996.

[20]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,editors.AdvancesinKnowledgeDiscoveryandDataMining.AAAI/MITPress,1996.

[21]J.H.Friedman.DataMiningandStatistics:What’stheConnection?Unpublished.www-stat.stanford.edu/~jhf/ftp/dm-stat.ps,1997.

[22]C.Giannella,J.Han,J.Pei,X.Yan,andP.S.Yu.MiningFrequentPatternsinDataStreamsatMultipleTimeGranularities.InH.Kargupta,A.Joshi,K.Sivakumar,andY.Yesha,editors,NextGenerationDataMining,pages191–212.AAAI/MIT,2003.

[23]C.Glymour,D.Madigan,D.Pregibon,andP.Smyth.StatisticalThemesandLessonsforDataMining.DataMiningandKnowledgeDiscovery,1(1):11–28,1997.

[24]R.L.Grossman,M.F.Hornick,andG.Meyer.Dataminingstandardsinitiatives.CommunicationsoftheACM,45(8):59–61,2002.

[25]R.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors.DataMiningforScientificandEngineeringApplications.KluwerAcademicPublishers,2001.

[26]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.

[27]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[28]J.Han,R.B.Altman,V.Kumar,H.Mannila,andD.Pregibon.Emergingscientificapplicationsindatamining.CommunicationsoftheACM,45(8):54–58,2002.

[29]J.Han,M.Kamber,andJ.Pei.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,3rdedition,2011.

[30]D.J.Hand.DataMining:StatisticsandMore?TheAmericanStatistician,52(2):112–118,1998.

[31]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[32]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,2ndedition,2009.

[33]T.Hey,S.Tansley,K.M.Tolle,etal.Thefourthparadigm:data-intensivescientificdiscovery,volume1.MicrosoftresearchRedmond,WA,2009.

[34]M.Kantardzic.DataMining:Concepts,Models,Methods,andAlgorithms.Wiley-IEEEPress,Piscataway,NJ,2003.

[35]H.KarguptaandP.K.Chan,editors.AdvancesinDistributedandParallelKnowledgeDiscovery.AAAIPress,September2002.

[36]H.Kargupta,S.Datta,Q.Wang,andK.Sivakumar.OnthePrivacyPreservingPropertiesofRandomDataPerturbationTechniques.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages99–106,Melbourne,Florida,December2003.IEEEComputerSociety.

[37]A.Karpatne,G.Atluri,J.Faghmous,M.Steinbach,A.Banerjee,A.Ganguly,S.Shekhar,N.Samatova,andV.Kumar.Theory-guidedDataScience:ANewParadigmforScientificDiscoveryfromData.IEEETransactionsonKnowledgeandDataEngineering,2017.

[38]D.Kifer,S.Ben-David,andJ.Gehrke.DetectingChangeinDataStreams.InProc.ofthe30thVLDBConf.,pages180–191,Toronto,Canada,2004.MorganKaufmann.

[39]J.Kleinberg,J.Ludwig,andS.Mullainathan.AGuidetoSolvingSocialProblemswithMachineLearning.HarvardBusinessReview,December2016.

[40]D.Lambert.WhatUseisStatisticsforMassiveData?InACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pages54–62,2000.

[41]M.H.C.Law,N.Zhang,andA.K.Jain.NonlinearManifoldLearningforDataStreams.InProc.oftheSIAMIntl.Conf.onDataMining,LakeBuenaVista,Florida,April2004.SIAM.

[42]Z.C.Lipton.Themythosofmodelinterpretability.arXivpreprintarXiv:1606.03490,2016.

[43]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[44]S.Papadimitriou,A.Brockwell,andC.Faloutsos.Adaptive,unsupervisedstreammining.VLDBJournal,13(3):222–239,2004.

[45]O.ParrRud.DataMiningCookbook:ModelingDataforMarketing,RiskandCustomerRelationshipManagement.JohnWiley…Sons,NewYork,NY,2001.

[46]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[47]D.Pyle.BusinessModelingandDataMining.MorganKaufmann,SanFrancisco,CA,2003.

[48]N.RamakrishnanandA.Grama.DataMining:FromSerendipitytoScience—GuestEditors’Introduction.IEEEComputer,32(8):34–37,1999.

[49]M.T.Ribeiro,S.Singh,andC.Guestrin.Whyshoulditrustyou?:Explainingthepredictionsofanyclassifier.InProceedingsofthe22ndACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1135–1144.ACM,2016.

[50]R.RoigerandM.Geatz.DataMining:ATutorialBasedPrimer.Addison-Wesley,2002.

[51]J.Schafer.TheApplicationofData-MiningtoRecommenderSystems.Encyclopediaofdatawarehousingandmining,1:44–48,2009.

[52]P.Smyth.BreakingoutoftheBlack-Box:ResearchChallengesinDataMining.InProc.ofthe2001ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,2001.

[53]P.Smyth,D.Pregibon,andC.Faloutsos.Data-drivenevolutionofdataminingalgorithms.CommunicationsoftheACM,45(8):33–37,2002.

[54]X.SuandT.M.Khoshgoftaar.Asurveyofcollaborativefilteringtechniques.Advancesinartificialintelligence,2009:4,2009.

[55]V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,andY.Theodoridis.State-of-the-artinprivacypreservingdatamining.SIGMODRecord,33(1):50–57,2004.

[56]J.T.L.Wang,M.J.Zaki,H.Toivonen,andD.E.Shasha,editors.DataMininginBioinformatics.Springer,September2004.

[57]A.R.Webb.StatisticalPatternRecognition.JohnWiley…Sons,2ndedition,2002.

[58]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniques.MorganKaufmann,3rdedition,2011.

[59]X.Wu,P.S.Yu,andG.Piatetsky-Shapiro.DataMining:HowResearchMeetsPracticalDevelopment?KnowledgeandInformationSystems,5(2):248–261,2003.

[60]M.J.ZakiandC.-T.Ho,editors.Large-ScaleParallelDataMining.Springer,September2002.

[61]M.J.ZakiandW.MeiraJr.DataMiningandAnalysis:FundamentalConceptsandAlgorithms.CambridgeUniversityPress,NewYork,2014.

1.7Exercises1.Discusswhetherornoteachofthefollowingactivitiesisadataminingtask.

a. Dividingthecustomersofacompanyaccordingtotheirgender.

b. Dividingthecustomersofacompanyaccordingtotheirprofitability.

c. Computingthetotalsalesofacompany.

d. Sortingastudentdatabasebasedonstudentidentificationnumbers.

e. Predictingtheoutcomesoftossinga(fair)pairofdice.

f. Predictingthefuturestockpriceofacompanyusinghistoricalrecords.

g. Monitoringtheheartrateofapatientforabnormalities.

h. Monitoringseismicwavesforearthquakeactivities.

i. Extractingthefrequenciesofasoundwave.

2.SupposethatyouareemployedasadataminingconsultantforanInternetsearchenginecompany.Describehowdataminingcanhelpthecompanybygivingspecificexamplesofhowtechniques,suchasclustering,classification,associationrulemining,andanomalydetectioncanbeapplied.

3.Foreachofthefollowingdatasets,explainwhetherornotdataprivacyisanimportantissue.

a. Censusdatacollectedfrom1900–1950.

b. IPaddressesandvisittimesofwebuserswhovisityourwebsite.

c. ImagesfromEarth-orbitingsatellites.

d. Namesandaddressesofpeoplefromthetelephonebook.

e. NamesandemailaddressescollectedfromtheWeb.

2Data

Thischapterdiscussesseveraldata-relatedissuesthatareimportantforsuccessfuldatamining:

TheTypeofDataDatasetsdifferinanumberofways.Forexample,theattributesusedtodescribedataobjectscanbeofdifferenttypes—quantitativeorqualitative—anddatasetsoftenhavespecialcharacteristics;e.g.,somedatasetscontaintimeseriesorobjectswithexplicitrelationshipstooneanother.Notsurprisingly,thetypeofdatadetermineswhichtoolsandtechniquescanbeusedtoanalyzethedata.Indeed,newresearchindataminingisoftendrivenbytheneedtoaccommodatenewapplicationareasandtheirnewtypesofdata.

TheQualityoftheDataDataisoftenfarfromperfect.Whilemostdataminingtechniquescantoleratesomelevelofimperfectioninthedata,afocusonunderstandingandimprovingdataqualitytypicallyimprovesthequalityoftheresultinganalysis.Dataqualityissuesthatoftenneedtobeaddressedincludethepresenceofnoiseandoutliers;missing,inconsistent,orduplicatedata;anddatathatisbiasedor,insomeotherway,unrepresentativeofthephenomenonorpopulationthatthedataissupposedtodescribe.

PreprocessingStepstoMaketheDataMoreSuitableforDataMiningOften,therawdatamustbeprocessedinordertomakeitsuitablefor

analysis.Whileoneobjectivemaybetoimprovedataquality,othergoalsfocusonmodifyingthedatasothatitbetterfitsaspecifieddataminingtechniqueortool.Forexample,acontinuousattribute,e.g.,length,sometimesneedstobetransformedintoanattributewithdiscretecategories,e.g.,short,medium,orlong,inordertoapplyaparticulartechnique.Asanotherexample,thenumberofattributesinadatasetisoftenreducedbecausemanytechniquesaremoreeffectivewhenthedatahasarelativelysmallnumberofattributes.

AnalyzingDatainTermsofItsRelationshipsOneapproachtodataanalysisistofindrelationshipsamongthedataobjectsandthenperformtheremaininganalysisusingtheserelationshipsratherthanthedataobjectsthemselves.Forinstance,wecancomputethesimilarityordistancebetweenpairsofobjectsandthenperformtheanalysis—clustering,classification,oranomalydetection—basedonthesesimilaritiesordistances.Therearemanysuchsimilarityordistancemeasures,andtheproperchoicedependsonthetypeofdataandtheparticularapplication.

Example2.1(AnIllustrationofData-RelatedIssues).Tofurtherillustratetheimportanceoftheseissues,considerthefollowinghypotheticalsituation.Youreceiveanemailfromamedicalresearcherconcerningaprojectthatyouareeagertoworkon.

Hi,

I’veattachedthedatafilethatImentionedinmypreviousemail.Eachlinecontainsthe

informationforasinglepatientandconsistsoffivefields.Wewanttopredictthelastfieldusing

theotherfields.Idon’thavetimetoprovideanymoreinformationaboutthedatasinceI’mgoing

outoftownforacoupleofdays,buthopefullythatwon’tslowyoudowntoomuch.Andifyou

don’tmind,couldwemeetwhenIgetbacktodiscussyourpreliminaryresults?Imightinvitea

fewothermembersofmyteam.

Thanksandseeyouinacoupleofdays.

Despitesomemisgivings,youproceedtoanalyzethedata.Thefirstfewrowsofthefileareasfollows:

012 232 33.5 0 10.7

020 121 16.9 2 210.1

027 165 24.0 0 427.6

Abrieflookatthedatarevealsnothingstrange.Youputyourdoubtsasideandstarttheanalysis.Thereareonly1000lines,asmallerdatafilethanyouhadhopedfor,buttwodayslater,youfeelthatyouhavemadesomeprogress.Youarriveforthemeeting,andwhilewaitingforotherstoarrive,youstrikeupaconversationwithastatisticianwhoisworkingontheproject.Whenshelearnsthatyouhavealsobeenanalyzingthedatafromtheproject,sheasksifyouwouldmindgivingherabriefoverviewofyourresults.

Statistician:So,yougotthedataforallthepatients?

DataMiner:Yes.Ihaven’thadmuchtimeforanalysis,butIdohaveafewinterestingresults.

Statistician:Amazing.ThereweresomanydataissueswiththissetofpatientsthatIcouldn’tdomuch.

DataMiner:Oh?Ididn’thearaboutanypossibleproblems.

Statistician:Well,firstthereisfield5,thevariablewewanttopredict.

It’scommonknowledgeamongpeoplewhoanalyzethistypeofdatathatresultsarebetterifyouworkwiththelogofthevalues,butIdidn’tdiscoverthisuntillater.Wasitmentionedtoyou?

DataMiner:No.

Statistician:Butsurelyyouheardaboutwhathappenedtofield4?It’ssupposedtobemeasuredonascalefrom1to10,with0indicatingamissingvalue,butbecauseofadataentryerror,all10’swerechangedinto0’s.Unfortunately,sincesomeofthepatientshavemissingvaluesforthisfield,it’simpossibletosaywhethera0inthisfieldisareal0ora10.Quiteafewoftherecordshavethatproblem.

DataMiner:Interesting.Werethereanyotherproblems?

Statistician:Yes,fields2and3arebasicallythesame,butIassumethatyouprobablynoticedthat.

DataMiner:Yes,butthesefieldswereonlyweakpredictorsoffield5.

Statistician:Anyway,givenallthoseproblems,I’msurprisedyouwereabletoaccomplishanything.

DataMiner:True,butmyresultsarereallyquitegood.Field1isaverystrongpredictoroffield5.I’msurprisedthatthiswasn’tnoticedbefore.

Statistician:What?Field1isjustanidentificationnumber.

DataMiner:Nonetheless,myresultsspeakforthemselves.

Statistician:Oh,no!Ijustremembered.WeassignedIDnumbersafterwesortedtherecordsbasedonfield5.Thereisastrongconnection,butit’smeaningless.Sorry.

Althoughthisscenariorepresentsanextremesituation,itemphasizestheimportanceof“knowingyourdata.”Tothatend,thischapterwilladdresseach

ofthefourissuesmentionedabove,outliningsomeofthebasicchallengesandstandardapproaches.

2.1TypesofDataAdatasetcanoftenbeviewedasacollectionofdataobjects.Othernamesforadataobjectarerecord,point,vector,pattern,event,case,sample,instance,observation,orentity.Inturn,dataobjectsaredescribedbyanumberofattributesthatcapturethecharacteristicsofanobject,suchasthemassofaphysicalobjectorthetimeatwhichaneventoccurred.Othernamesforanattributearevariable,characteristic,field,feature,ordimension.

Example2.2(StudentInformation).Often,adatasetisafile,inwhichtheobjectsarerecords(orrows)inthefileandeachfield(orcolumn)correspondstoanattribute.Forexample,Table2.1 showsadatasetthatconsistsofstudentinformation.Eachrowcorrespondstoastudentandeachcolumnisanattributethatdescribessomeaspectofastudent,suchasgradepointaverage(GPA)oridentificationnumber(ID).

Table2.1.Asampledatasetcontainingstudentinformation.

StudentID Year GradePointAverage(GPA) …

1034262 Senior 3.24 …

1052663 Freshman 3.51 …

1082246 Sophomore 3.62 …

Althoughrecord-baseddatasetsarecommon,eitherinflatfilesorrelationaldatabasesystems,thereareotherimportanttypesofdatasetsandsystemsforstoringdata.InSection2.1.2 ,wewilldiscusssomeofthetypesofdatasetsthatarecommonlyencounteredindatamining.However,wefirstconsiderattributes.

2.1.1AttributesandMeasurement

Inthissection,weconsiderthetypesofattributesusedtodescribedataobjects.Wefirstdefineanattribute,thenconsiderwhatwemeanbythetypeofanattribute,andfinallydescribethetypesofattributesthatarecommonlyencountered.

WhatIsanAttribute?Westartwithamoredetaileddefinitionofanattribute.

Definition2.1.Anattributeisapropertyorcharacteristicofanobjectthatcanvary,eitherfromoneobjecttoanotherorfromonetimetoanother.

Forexample,eyecolorvariesfrompersontoperson,whilethetemperatureofanobjectvariesovertime.Notethateyecolorisasymbolicattributewitha

smallnumberofpossiblevalues{brown,black,blue,green,hazel,etc.},whiletemperatureisanumericalattributewithapotentiallyunlimitednumberofvalues.

Atthemostbasiclevel,attributesarenotaboutnumbersorsymbols.However,todiscussandmorepreciselyanalyzethecharacteristicsofobjects,weassignnumbersorsymbolstothem.Todothisinawell-definedway,weneedameasurementscale.

Definition2.2.Ameasurementscaleisarule(function)thatassociatesanumericalorsymbolicvaluewithanattributeofanobject.

Formally,theprocessofmeasurementistheapplicationofameasurementscaletoassociateavaluewithaparticularattributeofaspecificobject.Whilethismayseemabitabstract,weengageintheprocessofmeasurementallthetime.Forinstance,westeponabathroomscaletodetermineourweight,weclassifysomeoneasmaleorfemale,orwecountthenumberofchairsinaroomtoseeiftherewillbeenoughtoseatallthepeoplecomingtoameeting.Inallthesecases,the“physicalvalue”ofanattributeofanobjectismappedtoanumericalorsymbolicvalue.

Withthisbackground,wecandiscussthetypeofanattribute,aconceptthatisimportantindeterminingifaparticulardataanalysistechniqueisconsistentwithaspecifictypeofattribute.

TheTypeofanAttributeItiscommontorefertothetypeofanattributeasthetypeofameasurementscale.Itshouldbeapparentfromthepreviousdiscussionthatanattributecanbedescribedusingdifferentmeasurementscalesandthatthepropertiesofanattributeneednotbethesameasthepropertiesofthevaluesusedtomeasureit.Inotherwords,thevaluesusedtorepresentanattributecanhavepropertiesthatarenotpropertiesoftheattributeitself,andviceversa.Thisisillustratedwithtwoexamples.

Example2.3(EmployeeAgeandIDNumber).TwoattributesthatmightbeassociatedwithanemployeeareIDandage(inyears).Bothoftheseattributescanberepresentedasintegers.However,whileitisreasonabletotalkabouttheaverageageofanemployee,itmakesnosensetotalkabouttheaverageemployeeID.Indeed,theonlyaspectofemployeesthatwewanttocapturewiththeIDattributeisthattheyaredistinct.Consequently,theonlyvalidoperationforemployeeIDsistotestwhethertheyareequal.Thereisnohintofthislimitation,however,whenintegersareusedtorepresenttheemployeeIDattribute.Fortheageattribute,thepropertiesoftheintegersusedtorepresentageareverymuchthepropertiesoftheattribute.Evenso,thecorrespondenceisnotcompletebecause,forexample,ageshaveamaximum,whileintegersdonot.

Example2.4(LengthofLineSegments).ConsiderFigure2.1 ,whichshowssomeobjects—linesegments—andhowthelengthattributeoftheseobjectscanbemappedtonumbersintwodifferentways.Eachsuccessivelinesegment,goingfromthetoptothebottom,isformedbyappendingthetopmostlinesegmenttoitself.Thus,

thesecondlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselftwice,thethirdlinesegmentfromthetopisformedbyappendingthetopmostlinesegmenttoitselfthreetimes,andsoforth.Inaveryreal(physical)sense,allthelinesegmentsaremultiplesofthefirst.Thisfactiscapturedbythemeasurementsontherightsideofthefigure,butnotbythoseontheleftside.Morespecifically,themeasurementscaleontheleftsidecapturesonlytheorderingofthelengthattribute,whilethescaleontherightsidecapturesboththeorderingandadditivityproperties.Thus,anattributecanbemeasuredinawaythatdoesnotcaptureallthepropertiesoftheattribute.

Figure2.1.Themeasurementofthelengthoflinesegmentsontwodifferentscalesofmeasurement.

Knowingthetypeofanattributeisimportantbecauseittellsuswhichpropertiesofthemeasuredvaluesareconsistentwiththeunderlying

propertiesoftheattribute,andtherefore,itallowsustoavoidfoolishactions,suchascomputingtheaverageemployeeID.

TheDifferentTypesofAttributesAuseful(andsimple)waytospecifythetypeofanattributeistoidentifythepropertiesofnumbersthatcorrespondtounderlyingpropertiesoftheattribute.Forexample,anattributesuchaslengthhasmanyofthepropertiesofnumbers.Itmakessensetocompareandorderobjectsbylength,aswellastotalkaboutthedifferencesandratiosoflength.Thefollowingproperties(operations)ofnumbersaretypicallyusedtodescribeattributes.

1. Distinctness and2. Order and3. Addition and4. Multiplication and/

Giventheseproperties,wecandefinefourtypesofattributes:nominal,ordinal,interval,andratio.Table2.2 givesthedefinitionsofthesetypes,alongwithinformationaboutthestatisticaloperationsthatarevalidforeachtype.Eachattributetypepossessesallofthepropertiesandoperationsoftheattributetypesaboveit.Consequently,anypropertyoroperationthatisvalidfornominal,ordinal,andintervalattributesisalsovalidforratioattributes.Inotherwords,thedefinitionoftheattributetypesiscumulative.However,thisdoesnotmeanthatthestatisticaloperationsappropriateforoneattributetypeareappropriatefortheattributetypesaboveit.

Table2.2.Differentattributetypes.

AttributeType Description Examples Operations

Categorical Nominal Thevaluesofanominalattribute zipcodes, mode,

= ≠<,≤,>, ≥

+ −×

(Qualitative) arejustdifferentnames;i.e.,nominalvaluesprovideonlyenoughinformationtodistinguishoneobjectfromanother.

employeeIDnumbers,eyecolor,gender

entropy,contingencycorrelation,test

Ordinal Thevaluesofanordinalattributeprovideenoughinformationtoorderobjects.

hardnessofminerals,{good,better,best},grades,streetnumbers

median,percentiles,rankcorrelation,runtests,signtests

Numeric(Quantitative)

Interval Forintervalattributes,thedifferencesbetweenvaluesaremeaningful,i.e.,aunitofmeasurementexists.

calendardates,temperatureinCelsiusorFahrenheit

mean,standarddeviation,Pearson’scorrelation,tandFtests

Ratio Forratiovariables,bothdifferencesandratiosaremeaningful.

temperatureinKelvin,monetaryquantities,counts,age,mass,length,electricalcurrent

geometricmean,harmonicmean,percentvariation

Nominalandordinalattributesarecollectivelyreferredtoascategoricalorqualitativeattributes.Asthenamesuggests,qualitativeattributes,suchasemployeeID,lackmostofthepropertiesofnumbers.Eveniftheyarerepresentedbynumbers,i.e.,integers,theyshouldbetreatedmorelikesymbols.Theremainingtwotypesofattributes,intervalandratio,arecollectivelyreferredtoasquantitativeornumericattributes.Quantitativeattributesarerepresentedbynumbersandhavemostofthepropertiesof

(=,≠) χ2

(<,>)

(+,−)

(×,/)

numbers.Notethatquantitativeattributescanbeinteger-valuedorcontinuous.

Thetypesofattributescanalsobedescribedintermsoftransformationsthatdonotchangethemeaningofanattribute.Indeed,S.SmithStevens,thepsychologistwhooriginallydefinedthetypesofattributesshowninTable2.2 ,definedthemintermsofthesepermissibletransformations.Forexample,themeaningofalengthattributeisunchangedifitismeasuredinmetersinsteadoffeet.

Thestatisticaloperationsthatmakesenseforaparticulartypeofattributearethosethatwillyieldthesameresultswhentheattributeistransformedbyusingatransformationthatpreservestheattribute’smeaning.Toillustrate,theaveragelengthofasetofobjectsisdifferentwhenmeasuredinmetersratherthaninfeet,butbothaveragesrepresentthesamelength.Table2.3 showsthemeaning-preservingtransformationsforthefourattributetypesofTable2.2 .

Table2.3.Transformationsthatdefineattributelevels.

AttributeType Transformation Comment

Categorical(Qualitative)

Nominal Anyone-to-onemapping,e.g.,apermutationofvalues

IfallemployeeIDnumbersarereassigned,itwillnotmakeanydifference.

Ordinal Anorder-preservingchangeofvalues,i.e.,

wherefisamonotonicfunction.

Anattributeencompassingthenotionofgood,better,bestcanberepresentedequallywellbythevalues{1,2,3}orby{0.5,1,10}.

Numeric(Quantitative)

Intervalaandbconstants.

TheFahrenheitandCelsiustemperaturescalesdifferinthe

new_value=f(old_value),

new_value=a×old_value+b,

locationoftheirzerovalueandthesizeofadegree(unit).

Ratio Lengthcanbemeasuredinmetersorfeet.

Example2.5(TemperatureScales).Temperatureprovidesagoodillustrationofsomeoftheconceptsthathavebeendescribed.First,temperaturecanbeeitheranintervaloraratioattribute,dependingonitsmeasurementscale.WhenmeasuredontheKelvinscale,atemperatureof2 is,inaphysicallymeaningfulway,twice

thatofatemperatureof1 .Thisisnottruewhentemperatureismeasured

oneithertheCelsiusorFahrenheitscales,because,physically,atemperatureof1 Fahrenheit(Celsius)isnotmuchdifferentthana

temperatureof2 Fahrenheit(Celsius).Theproblemisthatthezeropoints

oftheFahrenheitandCelsiusscalesare,inaphysicalsense,arbitrary,andtherefore,theratiooftwoCelsiusorFahrenheittemperaturesisnotphysicallymeaningful.

DescribingAttributesbytheNumberofValuesAnindependentwayofdistinguishingbetweenattributesisbythenumberofvaluestheycantake.

DiscreteAdiscreteattributehasafiniteorcountablyinfinitesetofvalues.Suchattributescanbecategorical,suchaszipcodesorIDnumbers,ornumeric,suchascounts.Discreteattributesareoftenrepresentedusingintegervariables.Binaryattributesareaspecialcaseofdiscreteattributesandassumeonlytwovalues,e.g.,true/false,yes/no,male/female,or0/1.

new_value=a×old_value

BinaryattributesareoftenrepresentedasBooleanvariables,orasintegervariablesthatonlytakethevalues0or1.

ContinuousAcontinuousattributeisonewhosevaluesarerealnumbers.Examplesincludeattributessuchastemperature,height,orweight.Continuousattributesaretypicallyrepresentedasfloating-pointvariables.Practically,realvaluescanbemeasuredandrepresentedonlywithlimitedprecision.

Intheory,anyofthemeasurementscaletypes—nominal,ordinal,interval,andratio—couldbecombinedwithanyofthetypesbasedonthenumberofattributevalues—binary,discrete,andcontinuous.However,somecombinationsoccuronlyinfrequentlyordonotmakemuchsense.Forinstance,itisdifficulttothinkofarealisticdatasetthatcontainsacontinuousbinaryattribute.Typically,nominalandordinalattributesarebinaryordiscrete,whileintervalandratioattributesarecontinuous.However,countattributes,whicharediscrete,arealsoratioattributes.

AsymmetricAttributesForasymmetricattributes,onlypresence—anon-zeroattributevalue—isregardedasimportant.Consideradatasetinwhicheachobjectisastudentandeachattributerecordswhetherastudenttookaparticularcourseatauniversity.Foraspecificstudent,anattributehasavalueof1ifthestudenttookthecourseassociatedwiththatattributeandavalueof0otherwise.Becausestudentstakeonlyasmallfractionofallavailablecourses,mostofthevaluesinsuchadatasetwouldbe0.Therefore,itismoremeaningfulandmoreefficienttofocusonthenon-zerovalues.Toillustrate,ifstudentsarecomparedonthebasisofthecoursestheydon’ttake,thenmoststudentswouldseemverysimilar,atleastifthenumberofcoursesislarge.Binaryattributeswhereonlynon-zerovaluesareimportantarecalledasymmetric

binaryattributes.Thistypeofattributeisparticularlyimportantforassociationanalysis,whichisdiscussedinChapter5 .Itisalsopossibletohavediscreteorcontinuousasymmetricfeatures.Forinstance,ifthenumberofcreditsassociatedwitheachcourseisrecorded,thentheresultingdatasetwillconsistofasymmetricdiscreteorcontinuousattributes.

GeneralCommentsonLevelsofMeasurementAsdescribedintherestofthischapter,therearemanydiversetypesofdata.Thepreviousdiscussionofmeasurementscales,whileuseful,isnotcompleteandhassomelimitations.Weprovidethefollowingcommentsandguidance.

Distinctness,order,andmeaningfulintervalsandratiosareonlyfourpropertiesofdata—manyothersarepossible.Forinstance,somedataisinherentlycyclical,e.g.,positiononthesurfaceoftheEarthortime.Asanotherexample,considersetvaluedattributes,whereeachattributevalueisasetofelements,e.g.,thesetofmoviesseeninthelastyear.Defineonesetofelements(movies)tobegreater(larger)thanasecondsetifthesecondsetisasubsetofthefirst.However,sucharelationshipdefinesonlyapartialorderthatdoesnotmatchanyoftheattributetypesjustdefined.Thenumbersorsymbolsusedtocaptureattributevaluesmaynotcaptureallthepropertiesoftheattributesormaysuggestpropertiesthatarenotthere.AnillustrationofthisforintegerswaspresentedinExample2.3 ,i.e.,averagesofIDsandoutofrangeages.Dataisoftentransformedforthepurposeofanalysis—seeSection2.3.7 .Thisoftenchangesthedistributionoftheobservedvariabletoadistributionthatiseasiertoanalyze,e.g.,aGaussian(normal)distribution.Often,suchtransformationsonlypreservetheorderoftheoriginalvalues,andotherpropertiesarelost.Nonetheless,ifthedesiredoutcomeisa

statisticaltestofdifferencesorapredictivemodel,suchatransformationisjustified.Thefinalevaluationofanydataanalysis,includingoperationsonattributes,iswhethertheresultsmakesensefromadomainpointofview.

Insummary,itcanbechallengingtodeterminewhichoperationscanbeperformedonaparticularattributeoracollectionofattributeswithoutcompromisingtheintegrityoftheanalysis.Fortunately,establishedpracticeoftenservesasareliableguide.Occasionally,however,standardpracticesareerroneousorhavelimitations.

2.1.2TypesofDataSets

Therearemanytypesofdatasets,andasthefieldofdataminingdevelopsandmatures,agreatervarietyofdatasetsbecomeavailableforanalysis.Inthissection,wedescribesomeofthemostcommontypes.Forconvenience,wehavegroupedthetypesofdatasetsintothreegroups:recorddata,graph-baseddata,andordereddata.Thesecategoriesdonotcoverallpossibilitiesandothergroupingsarecertainlypossible.

GeneralCharacteristicsofDataSetsBeforeprovidingdetailsofspecifickindsofdatasets,wediscussthreecharacteristicsthatapplytomanydatasetsandhaveasignificantimpactonthedataminingtechniquesthatareused:dimensionality,distribution,andresolution.

Dimensionality

Thedimensionalityofadatasetisthenumberofattributesthattheobjectsinthedatasetpossess.Analyzingdatawithasmallnumberofdimensionstendstobequalitativelydifferentfromanalyzingmoderateorhigh-dimensionaldata.Indeed,thedifficultiesassociatedwiththeanalysisofhigh-dimensionaldataaresometimesreferredtoasthecurseofdimensionality.Becauseofthis,animportantmotivationinpreprocessingthedataisdimensionalityreduction.TheseissuesarediscussedinmoredepthlaterinthischapterandinAppendixB.

Distribution

Thedistributionofadatasetisthefrequencyofoccurrenceofvariousvaluesorsetsofvaluesfortheattributescomprisingdataobjects.Equivalently,thedistributionofadatasetcanbeconsideredasadescriptionoftheconcentrationofobjectsinvariousregionsofthedataspace.Statisticianshaveenumeratedmanytypesofdistributions,e.g.,Gaussian(normal),anddescribedtheirproperties.(SeeAppendixC.)Althoughstatisticalapproachesfordescribingdistributionscanyieldpowerfulanalysistechniques,manydatasetshavedistributionsthatarenotwellcapturedbystandardstatisticaldistributions.

Asaresult,manydataminingalgorithmsdonotassumeaparticularstatisticaldistributionforthedatatheyanalyze.However,somegeneralaspectsofdistributionsoftenhaveastrongimpact.Forexample,supposeacategoricalattributeisusedasaclassvariable,whereoneofthecategoriesoccurs95%ofthetime,whiletheothercategoriestogetheroccuronly5%ofthetime.ThisskewnessinthedistributioncanmakeclassificationdifficultasdiscussedinSection4.11.(Skewnesshasotherimpactsondataanalysisthatarenotdiscussedhere.)

Aspecialcaseofskeweddataissparsity.Forsparsebinary,countorcontinuousdata,mostattributesofanobjecthavevaluesof0.Inmanycases,fewerthan1%ofthevaluesarenon-zero.Inpracticalterms,sparsityisanadvantagebecauseusuallyonlythenon-zerovaluesneedtobestoredandmanipulated.Thisresultsinsignificantsavingswithrespecttocomputationtimeandstorage.Indeed,somedataminingalgorithms,suchastheassociationruleminingalgorithmsdescribedinChapter5 ,workwellonlyforsparsedata.Finally,notethatoftentheattributesinsparsedatasetsareasymmetricattributes.

Resolution

Itisfrequentlypossibletoobtaindataatdifferentlevelsofresolution,andoftenthepropertiesofthedataaredifferentatdifferentresolutions.Forinstance,thesurfaceoftheEarthseemsveryunevenataresolutionofafewmeters,butisrelativelysmoothataresolutionoftensofkilometers.Thepatternsinthedataalsodependonthelevelofresolution.Iftheresolutionistoofine,apatternmaynotbevisibleormaybeburiedinnoise;iftheresolutionistoocoarse,thepatterncandisappear.Forexample,variationsinatmosphericpressureonascaleofhoursreflectthemovementofstormsandotherweathersystems.Onascaleofmonths,suchphenomenaarenotdetectable.

RecordDataMuchdataminingworkassumesthatthedatasetisacollectionofrecords(dataobjects),eachofwhichconsistsofafixedsetofdatafields(attributes).SeeFigure2.2(a) .Forthemostbasicformofrecorddata,thereisnoexplicitrelationshipamongrecordsordatafields,andeveryrecord(object)hasthesamesetofattributes.Recorddataisusuallystoredeitherinflatfilesorinrelationaldatabases.Relationaldatabasesarecertainlymorethana

collectionofrecords,butdataminingoftendoesnotuseanyoftheadditionalinformationavailableinarelationaldatabase.Rather,thedatabaseservesasaconvenientplacetofindrecords.DifferenttypesofrecorddataaredescribedbelowandareillustratedinFigure2.2 .

Figure2.2.Differentvariationsofrecorddata.

TransactionorMarketBasketData

Transactiondataisaspecialtypeofrecorddata,whereeachrecord(transaction)involvesasetofitems.Consideragrocerystore.Thesetofproductspurchasedbyacustomerduringoneshoppingtripconstitutesatransaction,whiletheindividualproductsthatwerepurchasedaretheitems.Thistypeofdataiscalledmarketbasketdatabecausetheitemsineachrecordaretheproductsinaperson’s“marketbasket.”Transactiondataisacollectionofsetsofitems,butitcanbeviewedasasetofrecordswhosefieldsareasymmetricattributes.Mostoften,theattributesarebinary,indicatingwhetheranitemwaspurchased,butmoregenerally,theattributescanbediscreteorcontinuous,suchasthenumberofitemspurchasedortheamountspentonthoseitems.Figure2.2(b) showsasampletransactiondataset.Eachrowrepresentsthepurchasesofaparticularcustomerataparticulartime.

TheDataMatrix

Ifallthedataobjectsinacollectionofdatahavethesamefixedsetofnumericattributes,thenthedataobjectscanbethoughtofaspoints(vectors)inamultidimensionalspace,whereeachdimensionrepresentsadistinctattributedescribingtheobject.Asetofsuchdataobjectscanbeinterpretedasanmbynmatrix,wheretherearemrows,oneforeachobject,andncolumns,oneforeachattribute.(Arepresentationthathasdataobjectsascolumnsandattributesasrowsisalsofine.)Thismatrixiscalledadatamatrixorapatternmatrix.Adatamatrixisavariationofrecorddata,butbecauseitconsistsofnumericattributes,standardmatrixoperationcanbeappliedtotransformandmanipulatethedata.Therefore,thedatamatrixisthestandarddataformatformoststatisticaldata.Figure2.2(c) showsasampledatamatrix.

TheSparseDataMatrix

Asparsedatamatrixisaspecialcaseofadatamatrixwheretheattributesareofthesametypeandareasymmetric;i.e.,onlynon-zerovaluesareimportant.Transactiondataisanexampleofasparsedatamatrixthathasonly0–1entries.Anothercommonexampleisdocumentdata.Inparticular,iftheorderoftheterms(words)inadocumentisignored—the“bagofwords”approach—thenadocumentcanberepresentedasatermvector,whereeachtermisacomponent(attribute)ofthevectorandthevalueofeachcomponentisthenumberoftimesthecorrespondingtermoccursinthedocument.Thisrepresentationofacollectionofdocumentsisoftencalledadocument-termmatrix.Figure2.2(d) showsasampledocument-termmatrix.Thedocumentsaretherowsofthismatrix,whilethetermsarethecolumns.Inpractice,onlythenon-zeroentriesofsparsedatamatricesarestored.

Graph-BasedDataAgraphcansometimesbeaconvenientandpowerfulrepresentationfordata.Weconsidertwospecificcases:(1)thegraphcapturesrelationshipsamongdataobjectsand(2)thedataobjectsthemselvesarerepresentedasgraphs.

DatawithRelationshipsamongObjects

Therelationshipsamongobjectsfrequentlyconveyimportantinformation.Insuchcases,thedataisoftenrepresentedasagraph.Inparticular,thedataobjectsaremappedtonodesofthegraph,whiletherelationshipsamongobjectsarecapturedbythelinksbetweenobjectsandlinkproperties,suchasdirectionandweight.ConsiderwebpagesontheWorldWideWeb,whichcontainbothtextandlinkstootherpages.Inordertoprocesssearchqueries,websearchenginescollectandprocesswebpagestoextracttheircontents.Itiswell-known,however,thatthelinkstoandfromeachpageprovideagreatdealofinformationabouttherelevanceofawebpagetoaquery,andthus,mustalsobetakenintoconsideration.Figure2.3(a) showsasetoflinked

webpages.Anotherimportantexampleofsuchgraphdataarethesocialnetworks,wheredataobjectsarepeopleandtherelationshipsamongthemaretheirinteractionsviasocialmedia.

DatawithObjectsThatAreGraphs

Ifobjectshavestructure,thatis,theobjectscontainsubobjectsthathaverelationships,thensuchobjectsarefrequentlyrepresentedasgraphs.Forexample,thestructureofchemicalcompoundscanberepresentedbyagraph,wherethenodesareatomsandthelinksbetweennodesarechemicalbonds.Figure2.3(b) showsaball-and-stickdiagramofthechemicalcompoundbenzene,whichcontainsatomsofcarbon(black)andhydrogen(gray).Agraphrepresentationmakesitpossibletodeterminewhichsubstructuresoccurfrequentlyinasetofcompoundsandtoascertainwhetherthepresenceofanyofthesesubstructuresisassociatedwiththepresenceorabsenceofcertainchemicalproperties,suchasmeltingpointorheatofformation.Frequentgraphmining,whichisabranchofdataminingthatanalyzessuchdata,isconsideredinSection6.5.

Figure2.3.Differentvariationsofgraphdata.

OrderedDataForsometypesofdata,theattributeshaverelationshipsthatinvolveorderintimeorspace.DifferenttypesofordereddataaredescribednextandareshowninFigure2.4 .

SequentialTransactionData

Sequentialtransactiondatacanbethoughtofasanextensionoftransactiondata,whereeachtransactionhasatimeassociatedwithit.Consideraretailtransactiondatasetthatalsostoresthetimeatwhichthetransactiontookplace.Thistimeinformationmakesitpossibletofindpatternssuchas“candysalespeakbeforeHalloween.”Atimecanalsobeassociatedwitheachattribute.Forexample,eachrecordcouldbethepurchasehistoryofa

customer,withalistingofitemspurchasedatdifferenttimes.Usingthisinformation,itispossibletofindpatternssuchas“peoplewhobuyDVDplayerstendtobuyDVDsintheperiodimmediatelyfollowingthepurchase.”

Figure2.4(a) showsanexampleofsequentialtransactiondata.Therearefivedifferenttimes—t1,t2,t3,t4,andt5;threedifferentcustomers—C1,C2,andC3;andfivedifferentitems—A,B,C,D,andE.Inthetoptable,eachrowcorrespondstotheitemspurchasedataparticulartimebyeachcustomer.Forinstance,attimet3,customerC2purchaseditemsAandD.Inthebottomtable,thesameinformationisdisplayed,buteachrowcorrespondstoaparticularcustomer.Eachrowcontainsinformationabouteachtransactioninvolvingthecustomer,whereatransactionisconsideredtobeasetofitemsandthetimeatwhichthoseitemswerepurchased.Forexample,customerC3boughtitemsAandCattimet2.

TimeSeriesData

Timeseriesdataisaspecialtypeofordereddatawhereeachrecordisatimeseries,i.e.,aseriesofmeasurementstakenovertime.Forexample,afinancialdatasetmightcontainobjectsthataretimeseriesofthedailypricesofvariousstocks.Asanotherexample,considerFigure2.4(c) ,whichshowsatimeseriesoftheaveragemonthlytemperatureforMinneapolisduringtheyears1982to1994.Whenworkingwithtemporaldata,suchastimeseries,itisimportanttoconsidertemporalautocorrelation;i.e.,iftwomeasurementsarecloseintime,thenthevaluesofthosemeasurementsareoftenverysimilar.

Figure2.4.Differentvariationsofordereddata.

SequenceData

Sequencedataconsistsofadatasetthatisasequenceofindividualentities,suchasasequenceofwordsorletters.Itisquitesimilartosequentialdata,exceptthattherearenotimestamps;instead,therearepositionsinanorderedsequence.Forexample,thegeneticinformationofplantsandanimalscanberepresentedintheformofsequencesofnucleotidesthatareknownasgenes.Manyoftheproblemsassociatedwithgeneticsequencedatainvolvepredictingsimilaritiesinthestructureandfunctionofgenesfromsimilaritiesinnucleotidesequences.Figure2.4(b) showsasectionofthehumangeneticcodeexpressedusingthefournucleotidesfromwhichallDNAisconstructed:A,T,G,andC.

SpatialandSpatio-TemporalData

Someobjectshavespatialattributes,suchaspositionsorareas,inadditiontoothertypesofattributes.Anexampleofspatialdataisweatherdata(precipitation,temperature,pressure)thatiscollectedforavarietyofgeographicallocations.Oftensuchmeasurementsarecollectedovertime,andthus,thedataconsistsoftimeseriesatvariouslocations.Inthatcase,werefertothedataasspatio-temporaldata.Althoughanalysiscanbeconductedseparatelyforeachspecifictimeorlocation,amorecompleteanalysisofspatio-temporaldatarequiresconsiderationofboththespatialandtemporalaspectsofthedata.

Animportantaspectofspatialdataisspatialautocorrelation;i.e.,objectsthatarephysicallyclosetendtobesimilarinotherwaysaswell.Thus,twopointsontheEarththatareclosetoeachotherusuallyhavesimilarvaluesfortemperatureandrainfall.Notethatspatialautocorrelationisanalogoustotemporalautocorrelation.

Importantexamplesofspatialandspatio-temporaldataarethescienceandengineeringdatasetsthataretheresultofmeasurementsormodeloutput

takenatregularlyorirregularlydistributedpointsonatwo-orthree-dimensionalgridormesh.Forinstance,Earthsciencedatasetsrecordthetemperatureorpressuremeasuredatpoints(gridcells)onlatitude–longitudesphericalgridsofvariousresolutions,e.g., by SeeFigure2.4(d) .Asanotherexample,inthesimulationoftheflowofagas,thespeedanddirectionofflowatvariousinstantsintimecanberecordedforeachgridpointinthesimulation.Adifferenttypeofspatio-temporaldataarisesfromtrackingthetrajectoriesofobjects,e.g.,vehicles,intimeandspace.

HandlingNon-RecordDataMostdataminingalgorithmsaredesignedforrecorddataoritsvariations,suchastransactiondataanddatamatrices.Record-orientedtechniquescanbeappliedtonon-recorddatabyextractingfeaturesfromdataobjectsandusingthesefeaturestocreatearecordcorrespondingtoeachobject.Considerthechemicalstructuredatathatwasdescribedearlier.Givenasetofcommonsubstructures,eachcompoundcanberepresentedasarecordwithbinaryattributesthatindicatewhetheracompoundcontainsaspecificsubstructure.Sucharepresentationisactuallyatransactiondataset,wherethetransactionsarethecompoundsandtheitemsarethesubstructures.

Insomecases,itiseasytorepresentthedatainarecordformat,butthistypeofrepresentationdoesnotcapturealltheinformationinthedata.Considerspatio-temporaldataconsistingofatimeseriesfromeachpointonaspatialgrid.Thisdataisoftenstoredinadatamatrix,whereeachrowrepresentsalocationandeachcolumnrepresentsaparticularpointintime.However,sucharepresentationdoesnotexplicitlycapturethetimerelationshipsthatarepresentamongattributesandthespatialrelationshipsthatexistamongobjects.Thisdoesnotmeanthatsucharepresentationisinappropriate,butratherthattheserelationshipsmustbetakenintoconsiderationduringtheanalysis.Forexample,itwouldnotbeagoodideatouseadatamining

1° 1°.

techniquethatignoresthetemporalautocorrelationoftheattributesorthespatialautocorrelationofthedataobjects,i.e.,thelocationsonthespatialgrid.

2.2DataQualityDataminingalgorithmsareoftenappliedtodatathatwascollectedforanotherpurpose,orforfuture,butunspecifiedapplications.Forthatreason,dataminingcannotusuallytakeadvantageofthesignificantbenefitsof“ad-dressingqualityissuesatthesource.”Incontrast,muchofstatisticsdealswiththedesignofexperimentsorsurveysthatachieveaprespecifiedlevelofdataquality.Becausepreventingdataqualityproblemsistypicallynotanoption,dataminingfocuseson(1)thedetectionandcorrectionofdataqualityproblemsand(2)theuseofalgorithmsthatcantoleratepoordataquality.Thefirststep,detectionandcorrection,isoftencalleddatacleaning.

Thefollowingsectionsdiscussspecificaspectsofdataquality.Thefocusisonmeasurementanddatacollectionissues,althoughsomeapplication-relatedissuesarealsodiscussed.

2.2.1MeasurementandDataCollectionIssues

Itisunrealistictoexpectthatdatawillbeperfect.Theremaybeproblemsduetohumanerror,limitationsofmeasuringdevices,orflawsinthedatacollectionprocess.Valuesorevenentiredataobjectscanbemissing.Inothercases,therecanbespuriousorduplicateobjects;i.e.,multipledataobjectsthatallcorrespondtoasingle“real”object.Forexample,theremightbetwodifferentrecordsforapersonwhohasrecentlylivedattwodifferentaddresses.Evenif

allthedataispresentand“looksfine,”theremaybeinconsistencies—apersonhasaheightof2meters,butweighsonly2kilograms.

Inthenextfewsections,wefocusonaspectsofdataqualitythatarerelatedtodatameasurementandcollection.Webeginwithadefinitionofmeasurementanddatacollectionerrorsandthenconsideravarietyofproblemsthatinvolvemeasurementerror:noise,artifacts,bias,precision,andaccuracy.Weconcludebydiscussingdataqualityissuesthatinvolvebothmeasurementanddatacollectionproblems:outliers,missingandinconsistentvalues,andduplicatedata.

MeasurementandDataCollectionErrorsThetermmeasurementerrorreferstoanyproblemresultingfromthemeasurementprocess.Acommonproblemisthatthevaluerecordeddiffersfromthetruevaluetosomeextent.Forcontinuousattributes,thenumericaldifferenceofthemeasuredandtruevalueiscalledtheerror.Thetermdatacollectionerrorreferstoerrorssuchasomittingdataobjectsorattributevalues,orinappropriatelyincludingadataobject.Forexample,astudyofanimalsofacertainspeciesmightincludeanimalsofarelatedspeciesthataresimilarinappearancetothespeciesofinterest.Bothmeasurementerrorsanddatacollectionerrorscanbeeithersystematicorrandom.

Wewillonlyconsidergeneraltypesoferrors.Withinparticulardomains,certaintypesofdataerrorsarecommonplace,andwell-developedtechniquesoftenexistfordetectingand/orcorrectingtheseerrors.Forexample,keyboarderrorsarecommonwhendataisenteredmanually,andasaresult,manydataentryprogramshavetechniquesfordetectingand,withhumanintervention,correctingsucherrors.

NoiseandArtifactsNoiseistherandomcomponentofameasurementerror.Ittypicallyinvolvesthedistortionofavalueortheadditionofspuriousobjects.Figure2.5showsatimeseriesbeforeandafterithasbeendisruptedbyrandomnoise.Ifabitmorenoisewereaddedtothetimeseries,itsshapewouldbelost.Figure2.6 showsasetofdatapointsbeforeandaftersomenoisepoints(indicatedby )havebeenadded.Noticethatsomeofthenoisepointsareintermixedwiththenon-noisepoints.

Figure2.5.Noiseinatimeseriescontext.

‘+’s

Figure2.6.Noiseinaspatialcontext.

Thetermnoiseisoftenusedinconnectionwithdatathathasaspatialortemporalcomponent.Insuchcases,techniquesfromsignalorimageprocessingcanfrequentlybeusedtoreducenoiseandthus,helptodiscoverpatterns(signals)thatmightbe“lostinthenoise.”Nonetheless,theeliminationofnoiseisfrequentlydifficult,andmuchworkindataminingfocusesondevisingrobustalgorithmsthatproduceacceptableresultsevenwhennoiseispresent.

Dataerrorscanbetheresultofamoredeterministicphenomenon,suchasastreakinthesameplaceonasetofphotographs.Suchdeterministicdistortionsofthedataareoftenreferredtoasartifacts.

Precision,Bias,andAccuracyInstatisticsandexperimentalscience,thequalityofthemeasurementprocessandtheresultingdataaremeasuredbyprecisionandbias.Weprovidethe

standarddefinitions,followedbyabriefdiscussion.Forthefollowingdefinitions,weassumethatwemakerepeatedmeasurementsofthesameunderlyingquantity.

Definition2.3(Precision).Theclosenessofrepeatedmeasurements(ofthesamequantity)tooneanother.

Definition2.4(Bias).Asystematicvariationofmeasurementsfromthequantitybeingmeasured.

Precisionisoftenmeasuredbythestandarddeviationofasetofvalues,whilebiasismeasuredbytakingthedifferencebetweenthemeanofthesetofvaluesandtheknownvalueofthequantitybeingmeasured.Biascanbedeterminedonlyforobjectswhosemeasuredquantityisknownbymeansexternaltothecurrentsituation.Supposethatwehaveastandardlaboratoryweightwithamassof1gandwanttoassesstheprecisionandbiasofournewlaboratoryscale.Weweighthemassfivetimes,andobtainthefollowingfivevalues:{1.015,0.990,1.013,1.001,0.986}.Themeanofthesevaluesis

1.001,andhence,thebiasis0.001.Theprecision,asmeasuredbythestandarddeviation,is0.013.

Itiscommontousethemoregeneralterm,accuracy,torefertothedegreeofmeasurementerrorindata.

Definition2.5(Accuracy)Theclosenessofmeasurementstothetruevalueofthequantitybeingmeasured.

Accuracydependsonprecisionandbias,butthereisnospecificformulaforaccuracyintermsofthesetwoquantities.

Oneimportantaspectofaccuracyistheuseofsignificantdigits.Thegoalistouseonlyasmanydigitstorepresenttheresultofameasurementorcalculationasarejustifiedbytheprecisionofthedata.Forexample,ifthelengthofanobjectismeasuredwithameterstickwhosesmallestmarkingsaremillimeters,thenweshouldrecordthelengthofdataonlytothenearestmillimeter.Theprecisionofsuchameasurementwouldbe Wedonotreviewthedetailsofworkingwithsignificantdigitsbecausemostreaderswillhaveencounteredtheminpreviouscoursesandtheyarecoveredinconsiderabledepthinscience,engineering,andstatisticstextbooks.

Issuessuchassignificantdigits,precision,bias,andaccuracyaresometimesoverlooked,buttheyareimportantfordataminingaswellasstatisticsandscience.Manytimes,datasetsdonotcomewithinformationaboutthe

±0.5mm.

precisionofthedata,andfurthermore,theprogramsusedforanalysisreturnresultswithoutanysuchinformation.Nonetheless,withoutsomeunderstandingoftheaccuracyofthedataandtheresults,ananalystrunstheriskofcommittingseriousdataanalysisblunders.

OutliersOutliersareeither(1)dataobjectsthat,insomesense,havecharacteristicsthataredifferentfrommostoftheotherdataobjectsinthedataset,or(2)valuesofanattributethatareunusualwithrespecttothetypicalvaluesforthatattribute.Alternatively,theycanbereferredtoasanomalousobjectsorvalues.Thereisconsiderableleewayinthedefinitionofanoutlier,andmanydifferentdefinitionshavebeenproposedbythestatisticsanddataminingcommunities.Furthermore,itisimportanttodistinguishbetweenthenotionsofnoiseandoutliers.Unlikenoise,outlierscanbelegitimatedataobjectsorvaluesthatweareinterestedindetecting.Forinstance,infraudandnetworkintrusiondetection,thegoalistofindunusualobjectsoreventsfromamongalargenumberofnormalones.Chapter9 discussesanomalydetectioninmoredetail.

MissingValuesItisnotunusualforanobjecttobemissingoneormoreattributevalues.Insomecases,theinformationwasnotcollected;e.g.,somepeopledeclinetogivetheirageorweight.Inothercases,someattributesarenotapplicabletoallobjects;e.g.,often,formshaveconditionalpartsthatarefilledoutonlywhenapersonanswersapreviousquestioninacertainway,butforsimplicity,allfieldsarestored.Regardless,missingvaluesshouldbetakenintoaccountduringthedataanalysis.

Thereareseveralstrategies(andvariationsonthesestrategies)fordealingwithmissingdata,eachofwhichisappropriateincertaincircumstances.Thesestrategiesarelistednext,alongwithanindicationoftheiradvantagesanddisadvantages.

EliminateDataObjectsorAttributes

Asimpleandeffectivestrategyistoeliminateobjectswithmissingvalues.However,evenapartiallyspecifieddataobjectcontainssomeinformation,andifmanyobjectshavemissingvalues,thenareliableanalysiscanbedifficultorimpossible.Nonetheless,ifadatasethasonlyafewobjectsthathavemissingvalues,thenitmaybeexpedienttoomitthem.Arelatedstrategyistoeliminateattributesthathavemissingvalues.Thisshouldbedonewithcaution,however,becausetheeliminatedattributesmaybetheonesthatarecriticaltotheanalysis.

EstimateMissingValues

Sometimesmissingdatacanbereliablyestimated.Forexample,consideratimeseriesthatchangesinareasonablysmoothfashion,buthasafew,widelyscatteredmissingvalues.Insuchcases,themissingvaluescanbeestimated(interpolated)byusingtheremainingvalues.Asanotherexample,consideradatasetthathasmanysimilardatapoints.Inthissituation,theattributevaluesofthepointsclosesttothepointwiththemissingvalueareoftenusedtoestimatethemissingvalue.Iftheattributeiscontinuous,thentheaverageattributevalueofthenearestneighborsisused;iftheattributeiscategorical,thenthemostcommonlyoccurringattributevaluecanbetaken.Foraconcreteillustration,considerprecipitationmeasurementsthatarerecordedbygroundstations.Forareasnotcontainingagroundstation,theprecipitationcanbeestimatedusingvaluesobservedatnearbygroundstations.

IgnoretheMissingValueduringAnalysis

Manydataminingapproachescanbemodifiedtoignoremissingvalues.Forexample,supposethatobjectsarebeingclusteredandthesimilaritybetweenpairsofdataobjectsneedstobecalculated.Ifoneorbothobjectsofapairhavemissingvaluesforsomeattributes,thenthesimilaritycanbecalculatedbyusingonlytheattributesthatdonothavemissingvalues.Itistruethatthesimilaritywillonlybeapproximate,butunlessthetotalnumberofattributesissmallorthenumberofmissingvaluesishigh,thisdegreeofinaccuracymaynotmattermuch.Likewise,manyclassificationschemescanbemodifiedtoworkwithmissingvalues.

InconsistentValuesDatacancontaininconsistentvalues.Consideranaddressfield,wherebothazipcodeandcityarelisted,butthespecifiedzipcodeareaisnotcontainedinthatcity.Itispossiblethattheindividualenteringthisinformationtransposedtwodigits,orperhapsadigitwasmisreadwhentheinformationwasscannedfromahandwrittenform.Regardlessofthecauseoftheinconsistentvalues,itisimportanttodetectand,ifpossible,correctsuchproblems.

Sometypesofinconsistencesareeasytodetect.Forinstance,aperson’sheightshouldnotbenegative.Inothercases,itcanbenecessarytoconsultanexternalsourceofinformation.Forexample,whenaninsurancecompanyprocessesclaimsforreimbursement,itchecksthenamesandaddressesonthereimbursementformsagainstadatabaseofitscustomers.

Onceaninconsistencyhasbeendetected,itissometimespossibletocorrectthedata.Aproductcodemayhave“check”digits,oritmaybepossibletodouble-checkaproductcodeagainstalistofknownproductcodes,andthen

correctthecodeifitisincorrect,butclosetoaknowncode.Thecorrectionofaninconsistencyrequiresadditionalorredundantinformation.

Example2.6(InconsistentSeaSurfaceTemperature).Thisexampleillustratesaninconsistencyinactualtimeseriesdatathatmeasurestheseasurfacetemperature(SST)atvariouspointsontheocean.SSTdatawasoriginallycollectedusingocean-basedmeasurementsfromshipsorbuoys,butmorerecently,satelliteshavebeenusedtogatherthedata.Tocreatealong-termdataset,bothsourcesofdatamustbeused.However,becausethedatacomesfromdifferentsources,thetwopartsofthedataaresubtlydifferent.ThisdiscrepancyisvisuallydisplayedinFigure2.7 ,whichshowsthecorrelationofSSTvaluesbetweenpairsofyears.Ifapairofyearshasapositivecorrelation,thenthelocationcorrespondingtothepairofyearsiscoloredwhite;otherwiseitiscoloredblack.(Seasonalvariationswereremovedfromthedatasince,otherwise,alltheyearswouldbehighlycorrelated.)Thereisadistinctchangeinbehaviorwherethedatahasbeenputtogetherin1983.Yearswithineachofthetwogroups,1958–1982and1983–1999,tendtohaveapositivecorrelationwithoneanother,butanegativecorrelationwithyearsintheothergroup.Thisdoesnotmeanthatthisdatashouldnotbeused,onlythattheanalystshouldconsiderthepotentialimpactofsuchdiscrepanciesonthedatamininganalysis.

Figure2.7.CorrelationofSSTdatabetweenpairsofyears.Whiteareasindicatepositivecorrelation.Blackareasindicatenegativecorrelation.

DuplicateDataAdatasetcanincludedataobjectsthatareduplicates,oralmostduplicates,ofoneanother.Manypeoplereceiveduplicatemailingsbecausetheyappearinadatabasemultipletimesunderslightlydifferentnames.Todetectandeliminatesuchduplicates,twomainissuesmustbeaddressed.First,iftherearetwoobjectsthatactuallyrepresentasingleobject,thenoneormorevaluesofcorrespondingattributesareusuallydifferent,andtheseinconsistentvaluesmustberesolved.Second,careneedstobetakentoavoidaccidentallycombiningdataobjectsthataresimilar,butnotduplicates,such

astwodistinctpeoplewithidenticalnames.Thetermdeduplicationisoftenusedtorefertotheprocessofdealingwiththeseissues.

Insomecases,twoormoreobjectsareidenticalwithrespecttotheattributesmeasuredbythedatabase,buttheystillrepresentdifferentobjects.Here,theduplicatesarelegitimate,butcanstillcauseproblemsforsomealgorithmsifthepossibilityofidenticalobjectsisnotspecificallyaccountedforintheirdesign.AnexampleofthisisgiveninExercise13 onpage108.

2.2.2IssuesRelatedtoApplications

Dataqualityissuescanalsobeconsideredfromanapplicationviewpointasexpressedbythestatement“dataisofhighqualityifitissuitableforitsintendeduse.”Thisapproachtodataqualityhasprovenquiteuseful,particularlyinbusinessandindustry.Asimilarviewpointisalsopresentinstatisticsandtheexperimentalsciences,withtheiremphasisonthecarefuldesignofexperimentstocollectthedatarelevanttoaspecifichypothesis.Aswithqualityissuesatthemeasurementanddatacollectionlevel,manyissuesarespecifictoparticularapplicationsandfields.Again,weconsideronlyafewofthegeneralissues.

Timeliness

Somedatastartstoageassoonasithasbeencollected.Inparticular,ifthedataprovidesasnapshotofsomeongoingphenomenonorprocess,suchasthepurchasingbehaviorofcustomersorwebbrowsingpatterns,thenthissnapshotrepresentsrealityforonlyalimitedtime.Ifthedataisoutofdate,thensoarethemodelsandpatternsthatarebasedonit.

Relevance

Theavailabledatamustcontaintheinformationnecessaryfortheapplication.Considerthetaskofbuildingamodelthatpredictstheaccidentratefordrivers.Ifinformationabouttheageandgenderofthedriverisomitted,thenitislikelythatthemodelwillhavelimitedaccuracyunlessthisinformationisindirectlyavailablethroughotherattributes.

Makingsurethattheobjectsinadatasetarerelevantisalsochallenging.Acommonproblemissamplingbias,whichoccurswhenasampledoesnotcontaindifferenttypesofobjectsinproportiontotheiractualoccurrenceinthepopulation.Forexample,surveydatadescribesonlythosewhorespondtothesurvey.(OtheraspectsofsamplingarediscussedfurtherinSection2.3.2 .)Becausetheresultsofadataanalysiscanreflectonlythedatathatispresent,samplingbiaswilltypicallyleadtoerroneousresultswhenappliedtothebroaderpopulation.

KnowledgeabouttheData

Ideally,datasetsareaccompaniedbydocumentationthatdescribesdifferentaspectsofthedata;thequalityofthisdocumentationcaneitheraidorhinderthesubsequentanalysis.Forexample,ifthedocumentationidentifiesseveralattributesasbeingstronglyrelated,theseattributesarelikelytoprovidehighlyredundantinformation,andweusuallydecidetokeepjustone.(Considersalestaxandpurchaseprice.)Ifthedocumentationispoor,however,andfailstotellus,forexample,thatthemissingvaluesforaparticularfieldareindicatedwitha-9999,thenouranalysisofthedatamaybefaulty.Otherimportantcharacteristicsaretheprecisionofthedata,thetypeoffeatures(nominal,ordinal,interval,ratio),thescaleofmeasurement(e.g.,metersorfeetforlength),andtheoriginofthedata.

2.3DataPreprocessingInthissection,weconsiderwhichpreprocessingstepsshouldbeappliedtomakethedatamoresuitablefordatamining.Datapreprocessingisabroadareaandconsistsofanumberofdifferentstrategiesandtechniquesthatareinterrelatedincomplexways.Wewillpresentsomeofthemostimportantideasandapproaches,andtrytopointouttheinterrelationshipsamongthem.Specifically,wewilldiscussthefollowingtopics:

AggregationSamplingDimensionalityreductionFeaturesubsetselectionFeaturecreationDiscretizationandbinarizationVariabletransformation

Roughlyspeaking,thesetopicsfallintotwocategories:selectingdataobjectsandattributesfortheanalysisorforcreating/changingtheattributes.Inbothcases,thegoalistoimprovethedatamininganalysiswithrespecttotime,cost,andquality.Detailsareprovidedinthefollowingsections.

Aquicknoteaboutterminology:Inthefollowing,wesometimesusesynonymsforattribute,suchasfeatureorvariable,inordertofollowcommonusage.

2.3.1Aggregation

Sometimes“lessismore,”andthisisthecasewithaggregation,thecombiningoftwoormoreobjectsintoasingleobject.Consideradatasetconsistingoftransactions(dataobjects)recordingthedailysalesofproductsinvariousstorelocations(Minneapolis,Chicago,Paris,…)fordifferentdaysoverthecourseofayear.SeeTable2.4 .Onewaytoaggregatetransactionsforthisdatasetistoreplaceallthetransactionsofasinglestorewithasinglestorewidetransaction.Thisreducesthehundredsorthousandsoftransactionsthatoccurdailyataspecificstoretoasingledailytransaction,andthenumberofdataobjectsperdayisreducedtothenumberofstores.

Table2.4.Datasetcontaininginformationaboutcustomerpurchases.

TransactionID Item StoreLocation Date Price …

⋮ ⋮ ⋮ ⋮ ⋮

101123 Watch Chicago 09/06/04 $25.99 …

101123 Battery Chicago 09/06/04 $5.99 …

101124 Shoes Minneapolis 09/06/04 $75.00 …

Anobviousissueishowanaggregatetransactioniscreated;i.e.,howthevaluesofeachattributearecombinedacrossalltherecordscorrespondingtoaparticularlocationtocreatetheaggregatetransactionthatrepresentsthesalesofasinglestoreordate.Quantitativeattributes,suchasprice,aretypicallyaggregatedbytakingasumoranaverage.Aqualitativeattribute,suchasitem,caneitherbeomittedorsummarizedintermsofahigherlevelcategory,e.g.,televisionsversuselectronics.

ThedatainTable2.4 canalsobeviewedasamultidimensionalarray,whereeachattributeisadimension.Fromthisviewpoint,aggregationistheprocessofeliminatingattributes,suchasthetypeofitem,orreducingthe

numberofvaluesforaparticularattribute;e.g.,reducingthepossiblevaluesfordatefrom365daysto12months.ThistypeofaggregationiscommonlyusedinOnlineAnalyticalProcessing(OLAP).ReferencestoOLAParegiveninthebibliographicNotes.

Thereareseveralmotivationsforaggregation.First,thesmallerdatasetsresultingfromdatareductionrequirelessmemoryandprocessingtime,andhence,aggregationoftenenablestheuseofmoreexpensivedataminingalgorithms.Second,aggregationcanactasachangeofscopeorscalebyprovidingahigh-levelviewofthedatainsteadofalow-levelview.Inthepreviousexample,aggregatingoverstorelocationsandmonthsgivesusamonthly,perstoreviewofthedatainsteadofadaily,peritemview.Finally,thebehaviorofgroupsofobjectsorattributesisoftenmorestablethanthatofindividualobjectsorattributes.Thisstatementreflectsthestatisticalfactthataggregatequantities,suchasaveragesortotals,havelessvariabilitythantheindividualvaluesbeingaggregated.Fortotals,theactualamountofvariationislargerthanthatofindividualobjects(onaverage),butthepercentageofthevariationissmaller,whileformeans,theactualamountofvariationislessthanthatofindividualobjects(onaverage).Adisadvantageofaggregationisthepotentiallossofinterestingdetails.Inthestoreexample,aggregatingovermonthslosesinformationaboutwhichdayoftheweekhasthehighestsales.

Example2.7(AustralianPrecipitation).ThisexampleisbasedonprecipitationinAustraliafromtheperiod1982–1993.Figure2.8(a) showsahistogramforthestandarddeviationofaveragemonthlyprecipitationfor by gridcellsinAustralia,whileFigure2.8(b) showsahistogramforthestandarddeviationoftheaverageyearlyprecipitationforthesamelocations.Theaverageyearlyprecipitationhaslessvariabilitythantheaveragemonthlyprecipitation.All

3,0300.5° 0.5°

precipitationmeasurements(andtheirstandarddeviations)areincentimeters.

Figure2.8.HistogramsofstandarddeviationformonthlyandyearlyprecipitationinAustraliafortheperiod1982–1993.

2.3.2Sampling

Samplingisacommonlyusedapproachforselectingasubsetofthedataobjectstobeanalyzed.Instatistics,ithaslongbeenusedforboththepreliminaryinvestigationofthedataandthefinaldataanalysis.Samplingcanalsobeveryusefulindatamining.However,themotivationsforsamplinginstatisticsanddataminingareoftendifferent.Statisticiansusesamplingbecauseobtainingtheentiresetofdataofinterestistooexpensiveortimeconsuming,whiledataminersusuallysamplebecauseitistoocomputationallyexpensiveintermsofthememoryortimerequiredtoprocess

allthedata.Insomecases,usingasamplingalgorithmcanreducethedatasizetothepointwhereabetter,butmorecomputationallyexpensivealgorithmcanbeused.

Thekeyprincipleforeffectivesamplingisthefollowing:Usingasamplewillworkalmostaswellasusingtheentiredatasetifthesampleisrepresentative.Inturn,asampleisrepresentativeifithasapproximatelythesameproperty(ofinterest)astheoriginalsetofdata.Ifthemean(average)ofthedataobjectsisthepropertyofinterest,thenasampleisrepresentativeifithasameanthatisclosetothatoftheoriginaldata.Becausesamplingisastatisticalprocess,therepresentativenessofanyparticularsamplewillvary,andthebestthatwecandoischooseasamplingschemethatguaranteesahighprobabilityofgettingarepresentativesample.Asdiscussednext,thisinvolveschoosingtheappropriatesamplesizeandsamplingtechnique.

SamplingApproachesTherearemanysamplingtechniques,butonlyafewofthemostbasiconesandtheirvariationswillbecoveredhere.Thesimplesttypeofsamplingissimplerandomsampling.Forthistypeofsampling,thereisanequalprobabilityofselectinganyparticularobject.Therearetwovariationsonrandomsampling(andothersamplingtechniquesaswell):(1)samplingwithoutreplacement—aseachobjectisselected,itisremovedfromthesetofallobjectsthattogetherconstitutethepopulation,and(2)samplingwithreplacement—objectsarenotremovedfromthepopulationastheyareselectedforthesample.Insamplingwithreplacement,thesameobjectcanbepickedmorethanonce.Thesamplesproducedbythetwomethodsarenotmuchdifferentwhensamplesarerelativelysmallcomparedtothedatasetsize,butsamplingwithreplacementissimplertoanalyzebecausetheprobabilityofselectinganyobjectremainsconstantduringthesamplingprocess.

Whenthepopulationconsistsofdifferenttypesofobjects,withwidelydifferentnumbersofobjects,simplerandomsamplingcanfailtoadequatelyrepresentthosetypesofobjectsthatarelessfrequent.Thiscancauseproblemswhentheanalysisrequiresproperrepresentationofallobjecttypes.Forexample,whenbuildingclassificationmodelsforrareclasses,itiscriticalthattherareclassesbeadequatelyrepresentedinthesample.Hence,asamplingschemethatcanaccommodatedifferingfrequenciesfortheobjecttypesofinterestisneeded.Stratifiedsampling,whichstartswithprespecifiedgroupsofobjects,issuchanapproach.Inthesimplestversion,equalnumbersofobjectsaredrawnfromeachgroupeventhoughthegroupsareofdifferentsizes.Inanothervariation,thenumberofobjectsdrawnfromeachgroupisproportionaltothesizeofthatgroup.

Example2.8(SamplingandLossofInformation).Onceasamplingtechniquehasbeenselected,itisstillnecessarytochoosethesamplesize.Largersamplesizesincreasetheprobabilitythatasamplewillberepresentative,buttheyalsoeliminatemuchoftheadvantageofsampling.Conversely,withsmallersamplesizes,patternscanbemissedorerroneouspatternscanbedetected.Figure2.9(a)showsadatasetthatcontains8000two-dimensionalpoints,whileFigures2.9(b) and2.9(c) showsamplesfromthisdatasetofsize2000and500,respectively.Althoughmostofthestructureofthisdatasetispresentinthesampleof2000points,muchofthestructureismissinginthesampleof500points.

Figure2.9.Exampleofthelossofstructurewithsampling.

Example2.9(DeterminingtheProperSampleSize).Toillustratethatdeterminingthepropersamplesizerequiresamethodicalapproach,considerthefollowingtask.

Givenasetofdataconsistingofasmallnumberofalmostequalsizedgroups,findatleastone

representativepointforeachofthegroups.Assumethattheobjectsineachgrouparehighly

similartoeachother,butnotverysimilartoobjectsindifferentgroups.Figure2.10(a) shows

anidealizedsetofclusters(groups)fromwhichthesepointsmightbedrawn.

Figure2.10.Findingrepresentativepointsfrom10groups.

Thisproblemcanbeefficientlysolvedusingsampling.Oneapproachistotakeasmallsampleofdatapoints,computethepairwisesimilaritiesbetweenpoints,andthenformgroupsofpointsthatarehighlysimilar.Thedesiredsetofrepresentativepointsisthenobtainedbytakingonepointfromeachofthesegroups.Tofollowthisapproach,however,weneedtodetermineasamplesizethatwouldguarantee,withahighprobability,thedesiredoutcome;thatis,thatatleastonepointwillbeobtainedfromeachcluster.Figure2.10(b) showstheprobabilityofgettingoneobjectfromeachofthe10groupsasthesamplesizerunsfrom10to60.Interestingly,withasamplesizeof20,thereislittlechance(20%)ofgettingasamplethatincludesall10clusters.Evenwithasamplesizeof30,thereisstillamoderatechance(almost40%)ofgettingasamplethatdoesn’tcontainobjectsfromall10clusters.ThisissueisfurtherexploredinthecontextofclusteringbyExercise4 onpage603.

ProgressiveSamplingThepropersamplesizecanbedifficulttodetermine,soadaptiveorprogressivesamplingschemesaresometimesused.Theseapproachesstartwithasmallsample,andthenincreasethesamplesizeuntilasampleofsufficientsizehasbeenobtained.Whilethistechniqueeliminatestheneedtodeterminethecorrectsamplesizeinitially,itrequiresthattherebeawaytoevaluatethesampletojudgeifitislargeenough.

Suppose,forinstance,thatprogressivesamplingisusedtolearnapredictivemodel.Althoughtheaccuracyofpredictivemodelsincreasesasthesamplesizeincreases,atsomepointtheincreaseinaccuracylevelsoff.Wewanttostopincreasingthesamplesizeatthisleveling-offpoint.Bykeepingtrackofthechangeinaccuracyofthemodelaswetakeprogressivelylargersamples,andbytakingothersamplesclosetothesizeofthecurrentone,wecangetanestimateofhowclosewearetothisleveling-offpoint,andthus,stopsampling.

2.3.3DimensionalityReduction

Datasetscanhavealargenumberoffeatures.Considerasetofdocuments,whereeachdocumentisrepresentedbyavectorwhosecomponentsarethefrequencieswithwhicheachwordoccursinthedocument.Insuchcases,therearetypicallythousandsortensofthousandsofattributes(components),oneforeachwordinthevocabulary.Asanotherexample,considerasetoftimeseriesconsistingofthedailyclosingpriceofvariousstocksoveraperiodof30years.Inthiscase,theattributes,whicharethepricesonspecificdays,againnumberinthethousands.

Thereareavarietyofbenefitstodimensionalityreduction.Akeybenefitisthatmanydataminingalgorithmsworkbetterifthedimensionality—thenumberofattributesinthedata—islower.Thisispartlybecausedimensionalityreductioncaneliminateirrelevantfeaturesandreducenoiseandpartlybecauseofthecurseofdimensionality,whichisexplainedbelow.Anotherbenefitisthatareductionofdimensionalitycanleadtoamoreunderstandablemodelbecausethemodelusuallyinvolvesfewerattributes.Also,dimensionalityreductionmayallowthedatatobemoreeasilyvisualized.Evenifdimensionalityreductiondoesn’treducethedatatotwoorthreedimensions,dataisoftenvisualizedbylookingatpairsortripletsofattributes,andthenumberofsuchcombinationsisgreatlyreduced.Finally,theamountoftimeandmemoryrequiredbythedataminingalgorithmisreducedwithareductionindimensionality.

Thetermdimensionalityreductionisoftenreservedforthosetechniquesthatreducethedimensionalityofadatasetbycreatingnewattributesthatareacombinationoftheoldattributes.Thereductionofdimensionalitybyselectingattributesthatareasubsetoftheoldisknownasfeaturesubsetselectionorfeatureselection.ItwillbediscussedinSection2.3.4 .

Intheremainderofthissection,webrieflyintroducetwoimportanttopics:thecurseofdimensionalityanddimensionalityreductiontechniquesbasedonlinearalgebraapproachessuchasprincipalcomponentsanalysis(PCA).MoredetailsondimensionalityreductioncanbefoundinAppendixB.

TheCurseofDimensionalityThecurseofdimensionalityreferstothephenomenonthatmanytypesofdataanalysisbecomesignificantlyharderasthedimensionalityofthedataincreases.Specifically,asdimensionalityincreases,thedatabecomesincreasinglysparseinthespacethatitoccupies.Thus,thedataobjectswe

observearequitepossiblynotarepresentativesampleofallpossibleobjects.Forclassification,thiscanmeanthattherearenotenoughdataobjectstoallowthecreationofamodelthatreliablyassignsaclasstoallpossibleobjects.Forclustering,thedifferencesindensityandinthedistancesbetweenpoints,whicharecriticalforclustering,becomelessmeaningful.(ThisisdiscussedfurtherinSections8.1.2,8.4.6,and8.4.8.)Asaresult,manyclusteringandclassificationalgorithms(andotherdataanalysisalgorithms)havetroublewithhigh-dimensionaldataleadingtoreducedclassificationaccuracyandpoorqualityclusters.

LinearAlgebraTechniquesforDimensionalityReductionSomeofthemostcommonapproachesfordimensionalityreduction,particularlyforcontinuousdata,usetechniquesfromlinearalgebratoprojectthedatafromahigh-dimensionalspaceintoalower-dimensionalspace.PrincipalComponentsAnalysis(PCA)isalinearalgebratechniqueforcontinuousattributesthatfindsnewattributes(principalcomponents)that(1)arelinearcombinationsoftheoriginalattributes,(2)areorthogonal(perpendicular)toeachother,and(3)capturethemaximumamountofvariationinthedata.Forexample,thefirsttwoprincipalcomponentscaptureasmuchofthevariationinthedataasispossiblewithtwoorthogonalattributesthatarelinearcombinationsoftheoriginalattributes.SingularValueDecomposition(SVD)isalinearalgebratechniquethatisrelatedtoPCAandisalsocommonlyusedfordimensionalityreduction.Foradditionaldetails,seeAppendicesAandB.

2.3.4FeatureSubsetSelection

Anotherwaytoreducethedimensionalityistouseonlyasubsetofthefeatures.Whileitmightseemthatsuchanapproachwouldloseinformation,thisisnotthecaseifredundantandirrelevantfeaturesarepresent.Redundantfeaturesduplicatemuchoralloftheinformationcontainedinoneormoreotherattributes.Forexample,thepurchasepriceofaproductandtheamountofsalestaxpaidcontainmuchofthesameinformation.Irrelevantfeaturescontainalmostnousefulinformationforthedataminingtaskathand.Forinstance,students’IDnumbersareirrelevanttothetaskofpredictingstudents’gradepointaverages.Redundantandirrelevantfeaturescanreduceclassificationaccuracyandthequalityoftheclustersthatarefound.

Whilesomeirrelevantandredundantattributescanbeeliminatedimmediatelybyusingcommonsenseordomainknowledge,selectingthebestsubsetoffeaturesfrequentlyrequiresasystematicapproach.Theidealapproachtofeatureselectionistotryallpossiblesubsetsoffeaturesasinputtothedataminingalgorithmofinterest,andthentakethesubsetthatproducesthebestresults.Thismethodhastheadvantageofreflectingtheobjectiveandbiasofthedataminingalgorithmthatwilleventuallybeused.Unfortunately,sincethenumberofsubsetsinvolvingnattributesis2 ,suchanapproachisimpractical

inmostsituationsandalternativestrategiesareneeded.Therearethreestandardapproachestofeatureselection:embedded,filter,andwrapper.

Embeddedapproaches

Featureselectionoccursnaturallyaspartofthedataminingalgorithm.Specifically,duringtheoperationofthedataminingalgorithm,thealgorithmitselfdecideswhichattributestouseandwhichtoignore.Algorithmsforbuildingdecisiontreeclassifiers,whicharediscussedinChapter3 ,oftenoperateinthismanner.

n

Filterapproaches

Featuresareselectedbeforethedataminingalgorithmisrun,usingsomeapproachthatisindependentofthedataminingtask.Forexample,wemightselectsetsofattributeswhosepairwisecorrelationisaslowaspossiblesothattheattributesarenon-redundant.

Wrapperapproaches

Thesemethodsusethetargetdataminingalgorithmasablackboxtofindthebestsubsetofattributes,inawaysimilartothatoftheidealalgorithmdescribedabove,buttypicallywithoutenumeratingallpossiblesubsets.

Becausetheembeddedapproachesarealgorithm-specific,onlythefilterandwrapperapproacheswillbediscussedfurtherhere.

AnArchitectureforFeatureSubsetSelectionItispossibletoencompassboththefilterandwrapperapproacheswithinacommonarchitecture.Thefeatureselectionprocessisviewedasconsistingoffourparts:ameasureforevaluatingasubset,asearchstrategythatcontrolsthegenerationofanewsubsetoffeatures,astoppingcriterion,andavalidationprocedure.Filtermethodsandwrappermethodsdifferonlyinthewayinwhichtheyevaluateasubsetoffeatures.Forawrappermethod,subsetevaluationusesthetargetdataminingalgorithm,whileforafilterapproach,theevaluationtechniqueisdistinctfromthetargetdataminingalgorithm.Thefollowingdiscussionprovidessomedetailsofthisapproach,whichissummarizedinFigure2.11 .

Figure2.11.Flowchartofafeaturesubsetselectionprocess.

Conceptually,featuresubsetselectionisasearchoverallpossiblesubsetsoffeatures.Manydifferenttypesofsearchstrategiescanbeused,butthesearchstrategyshouldbecomputationallyinexpensiveandshouldfindoptimalornearoptimalsetsoffeatures.Itisusuallynotpossibletosatisfybothrequirements,andthus,trade-offsarenecessary.

Anintegralpartofthesearchisanevaluationsteptojudgehowthecurrentsubsetoffeaturescomparestoothersthathavebeenconsidered.Thisrequiresanevaluationmeasurethatattemptstodeterminethegoodnessofasubsetofattributeswithrespecttoaparticulardataminingtask,suchasclassificationorclustering.Forthefilterapproach,suchmeasuresattempttopredicthowwelltheactualdataminingalgorithmwillperformonagivensetofattributes.Forthewrapperapproach,whereevaluationconsistsofactuallyrunningthetargetdataminingalgorithm,thesubsetevaluationfunctionissimplythecriterionnormallyusedtomeasuretheresultofthedatamining.

Becausethenumberofsubsetscanbeenormousanditisimpracticaltoexaminethemall,somesortofstoppingcriterionisnecessary.Thisstrategyisusuallybasedononeormoreconditionsinvolvingthefollowing:thenumberofiterations,whetherthevalueofthesubsetevaluationmeasureisoptimalorexceedsacertainthreshold,whetherasubsetofacertainsizehasbeenobtained,andwhetheranyimprovementcanbeachievedbytheoptionsavailabletothesearchstrategy.

Finally,onceasubsetoffeatureshasbeenselected,theresultsofthetargetdataminingalgorithmontheselectedsubsetshouldbevalidated.Astraightforwardvalidationapproachistorunthealgorithmwiththefullsetoffeaturesandcomparethefullresultstoresultsobtainedusingthesubsetoffeatures.Hopefully,thesubsetoffeatureswillproduceresultsthatarebetterthanoralmostasgoodasthoseproducedwhenusingallfeatures.Anothervalidationapproachistouseanumberofdifferentfeatureselectionalgorithmstoobtainsubsetsoffeaturesandthencomparetheresultsofrunningthedataminingalgorithmoneachsubset.

FeatureWeightingFeatureweightingisanalternativetokeepingoreliminatingfeatures.Moreimportantfeaturesareassignedahigherweight,whilelessimportantfeaturesaregivenalowerweight.Theseweightsaresometimesassignedbasedondomainknowledgeabouttherelativeimportanceoffeatures.Alternatively,theycansometimesbedeterminedautomatically.Forexample,someclassificationschemes,suchassupportvectormachines(Chapter4 ),produceclassificationmodelsinwhicheachfeatureisgivenaweight.Featureswithlargerweightsplayamoreimportantroleinthemodel.Thenormalizationofobjectsthattakesplacewhencomputingthecosinesimilarity(Section2.4.5 )canalsoberegardedasatypeoffeatureweighting.

2.3.5FeatureCreation

Itisfrequentlypossibletocreate,fromtheoriginalattributes,anewsetofattributesthatcapturestheimportantinformationinadatasetmuchmoreeffectively.Furthermore,thenumberofnewattributescanbesmallerthanthenumberoforiginalattributes,allowingustoreapallthepreviouslydescribedbenefitsofdimensionalityreduction.Tworelatedmethodologiesforcreatingnewattributesaredescribednext:featureextractionandmappingthedatatoanewspace.

FeatureExtractionThecreationofanewsetoffeaturesfromtheoriginalrawdataisknownasfeatureextraction.Considerasetofphotographs,whereeachphotographistobeclassifiedaccordingtowhetheritcontainsahumanface.Therawdataisasetofpixels,andassuch,isnotsuitableformanytypesofclassificationalgorithms.However,ifthedataisprocessedtoprovidehigher-levelfeatures,suchasthepresenceorabsenceofcertaintypesofedgesandareasthatarehighlycorrelatedwiththepresenceofhumanfaces,thenamuchbroadersetofclassificationtechniquescanbeappliedtothisproblem.

Unfortunately,inthesenseinwhichitismostcommonlyused,featureextractionishighlydomain-specific.Foraparticularfield,suchasimageprocessing,variousfeaturesandthetechniquestoextractthemhavebeendevelopedoveraperiodoftime,andoftenthesetechniqueshavelimitedapplicabilitytootherfields.Consequently,wheneverdataminingisappliedtoarelativelynewarea,akeytaskisthedevelopmentofnewfeaturesandfeatureextractionmethods.

Althoughfeatureextractionisoftencomplicated,Example2.10 illustratesthatitcanberelativelystraightforward.

Example2.10(Density).Consideradatasetconsistingofinformationabouthistoricalartifacts,which,alongwithotherinformation,containsthevolumeandmassofeachartifact.Forsimplicity,assumethattheseartifactsaremadeofasmallnumberofmaterials(wood,clay,bronze,gold)andthatwewanttoclassifytheartifactswithrespecttothematerialofwhichtheyaremade.Inthiscase,adensityfeatureconstructedfromthemassandvolumefeatures,i.e.,density=mass/volume,wouldmostdirectlyyieldanaccurateclassification.Althoughtherehavebeensomeattemptstoautomaticallyperformsuchsimplefeatureextractionbyexploringbasicmathematicalcombinationsofexistingattributes,themostcommonapproachistoconstructfeaturesusingdomainexpertise.

MappingtheDatatoaNewSpaceAtotallydifferentviewofthedatacanrevealimportantandinterestingfeatures.Consider,forexample,timeseriesdata,whichoftencontainsperiodicpatterns.Ifthereisonlyasingleperiodicpatternandnotmuchnoise,thenthepatterniseasilydetected.If,ontheotherhand,thereareanumberofperiodicpatternsandasignificantamountofnoise,thenthesepatternsarehardtodetect.Suchpatternscan,nonetheless,oftenbedetectedbyapplyingaFouriertransformtothetimeseriesinordertochangetoarepresentationinwhichfrequencyinformationisexplicit.InExample2.11 ,itwillnotbenecessarytoknowthedetailsoftheFouriertransform.Itisenoughtoknowthat,foreachtimeseries,theFouriertransformproducesanewdataobjectwhoseattributesarerelatedtofrequencies.

Example2.11(FourierAnalysis).ThetimeseriespresentedinFigure2.12(b) isthesumofthreeothertimeseries,twoofwhichareshowninFigure2.12(a) andhavefrequenciesof7and17cyclespersecond,respectively.Thethirdtimeseriesisrandomnoise.Figure2.12(c) showsthepowerspectrumthatcanbecomputedafterapplyingaFouriertransformtotheoriginaltimeseries.(Informally,thepowerspectrumisproportionaltothesquareofeachfrequencyattribute.)Inspiteofthenoise,therearetwopeaksthatcorrespondtotheperiodsofthetwooriginal,non-noisytimeseries.Again,themainpointisthatbetterfeaturescanrevealimportantaspectsofthedata.

Figure2.12.ApplicationoftheFouriertransformtoidentifytheunderlyingfrequenciesintimeseriesdata.

Manyothersortsoftransformationsarealsopossible.BesidestheFouriertransform,thewavelettransformhasalsoprovenveryusefulfortimeseriesandothertypesofdata.

2.3.6DiscretizationandBinarization

Somedataminingalgorithms,especiallycertainclassificationalgorithms,requirethatthedatabeintheformofcategoricalattributes.Algorithmsthatfindassociationpatternsrequirethatthedatabeintheformofbinaryattributes.Thus,itisoftennecessarytotransformacontinuousattributeintoacategoricalattribute(discretization),andbothcontinuousanddiscreteattributesmayneedtobetransformedintooneormorebinaryattributes(binarization).Additionally,ifacategoricalattributehasalargenumberofvalues(categories),orsomevaluesoccurinfrequently,thenitcanbebeneficialforcertaindataminingtaskstoreducethenumberofcategoriesbycombiningsomeofthevalues.

Aswithfeatureselection,thebestdiscretizationorbinarizationapproachistheonethat“producesthebestresultforthedataminingalgorithmthatwillbeusedtoanalyzethedata.”Itistypicallynotpracticaltoapplysuchacriteriondirectly.Consequently,discretizationorbinarizationisperformedinawaythatsatisfiesacriterionthatisthoughttohavearelationshiptogoodperformanceforthedataminingtaskbeingconsidered.Ingeneral,thebestdiscretizationdependsonthealgorithmbeingused,aswellastheotherattributesbeingconsidered.Typically,however,thediscretizationofeachattributeisconsideredinisolation.

BinarizationAsimpletechniquetobinarizeacategoricalattributeisthefollowing:Iftherearemcategoricalvalues,thenuniquelyassigneachoriginalvaluetoanintegerintheinterval Iftheattributeisordinal,thenordermustbemaintainedbytheassignment.(Notethateveniftheattributeisoriginallyrepresentedusingintegers,thisprocessisnecessaryiftheintegersarenotin

[0,m−1].

theinterval )Next,converteachofthesemintegerstoabinarynumber.Since binarydigitsarerequiredtorepresenttheseintegers,representthesebinarynumbersusingnbinaryattributes.Toillustrate,acategoricalvariablewith5values{awful,poor,OK,good,great}wouldrequirethreebinaryvariables and TheconversionisshowninTable2.5 .

Table2.5.Conversionofacategoricalattributetothreebinaryattributes.

CategoricalValue IntegerValue

awful 0 0 0 0

poor 1 0 0 1

OK 2 0 1 0

good 3 0 1 1

great 4 1 0 0

Suchatransformationcancausecomplications,suchascreatingunintendedrelationshipsamongthetransformedattributes.Forexample,inTable2.5 ,attributes and arecorrelatedbecauseinformationaboutthegoodvalueisencodedusingbothattributes.Furthermore,associationanalysisrequiresasymmetricbinaryattributes,whereonlythepresenceoftheattribute

isimportant.Forassociationproblems,itisthereforenecessarytointroduceoneasymmetricbinaryattributeforeachcategoricalvalue,asshowninTable2.6 .Ifthenumberofresultingattributesistoolarge,thenthetechniquesdescribedinthefollowingsectionscanbeusedtoreducethenumberofcategoricalvaluesbeforebinarization.

Table2.6.Conversionofacategoricalattributetofiveasymmetricbinary

[0,m−1].n=[log2(m)]

x1,x2, x3.

x1 x2 x3

x2 x3

(value=1).

attributes.

CategoricalValue IntegerValue

awful 0 1 0 0 0 0

poor 1 0 1 0 0 0

OK 2 0 0 1 0 0

good 3 0 0 0 1 0

great 4 0 0 0 0 1

Likewise,forassociationproblems,itcanbenecessarytoreplaceasinglebinaryattributewithtwoasymmetricbinaryattributes.Considerabinaryattributethatrecordsaperson’sgender,maleorfemale.Fortraditionalassociationrulealgorithms,thisinformationneedstobetransformedintotwoasymmetricbinaryattributes,onethatisa1onlywhenthepersonismaleandonethatisa1onlywhenthepersonisfemale.(Forasymmetricbinaryattributes,theinformationrepresentationissomewhatinefficientinthattwobitsofstoragearerequiredtorepresenteachbitofinformation.)

DiscretizationofContinuousAttributesDiscretizationistypicallyappliedtoattributesthatareusedinclassificationorassociationanalysis.Transformationofacontinuousattributetoacategoricalattributeinvolvestwosubtasks:decidinghowmanycategories,n,tohaveanddetermininghowtomapthevaluesofthecontinuousattributetothesecategories.Inthefirststep,afterthevaluesofthecontinuousattributearesorted,theyarethendividedintonintervalsbyspecifying splitpoints.Inthesecond,rathertrivialstep,allthevaluesinoneintervalaremappedtothesamecategoricalvalue.Therefore,theproblemofdiscretizationisoneof

x1 x2 x3 x4 x5

n−1

decidinghowmanysplitpointstochooseandwheretoplacethem.Theresultcanberepresentedeitherasasetofintervals

where and canbe or respectively,orequivalently,asaseriesofinequalities

UnsupervisedDiscretization

Abasicdistinctionbetweendiscretizationmethodsforclassificationiswhetherclassinformationisused(supervised)ornot(unsupervised).Ifclassinformationisnotused,thenrelativelysimpleapproachesarecommon.Forinstance,theequalwidthapproachdividestherangeoftheattributeintoauser-specifiednumberofintervalseachhavingthesamewidth.Suchanapproachcanbebadlyaffectedbyoutliers,andforthatreason,anequalfrequency(equaldepth)approach,whichtriestoputthesamenumberofobjectsintoeachinterval,isoftenpreferred.Asanotherexampleofunsuperviseddiscretization,aclusteringmethod,suchasK-means(seeChapter7 ),canalsobeused.Finally,visuallyinspectingthedatacansometimesbeaneffectiveapproach.

Example2.12(DiscretizationTechniques).Thisexampledemonstrateshowtheseapproachesworkonanactualdataset.Figure2.13(a) showsdatapointsbelongingtofourdifferentgroups,alongwithtwooutliers—thelargedotsoneitherend.Thetechniquesofthepreviousparagraphwereappliedtodiscretizethexvaluesofthesedatapointsintofourcategoricalvalues.(Pointsinthedatasethavearandomycomponenttomakeiteasytoseehowmanypointsareineachgroup.)Visuallyinspectingthedataworksquitewell,butisnotautomatic,andthus,wefocusontheotherthreeapproaches.Thesplitpointsproducedbythetechniquesequalwidth,equalfrequency,andK-meansareshownin

{(x0,x1],(x1,x2],…,(xn−1,xn)}, x0 xn +∞ −∞,

x0<x≤x1,…,xn−1<x<xn.

Figures2.13(b) ,2.13(c) ,and2.13(d) ,respectively.Thesplitpointsarerepresentedasdashedlines.

Figure2.13.Differentdiscretizationtechniques.

Inthisparticularexample,ifwemeasuretheperformanceofadiscretizationtechniquebytheextenttowhichdifferentobjectsthatclumptogetherhavethesamecategoricalvalue,thenK-meansperformsbest,followedbyequalfrequency,andfinally,equalwidth.Moregenerally,thebestdiscretizationwilldependontheapplicationandofteninvolvesdomain-specificdiscretization.Forexample,thediscretizationofpeopleintolowincome,middleincome,andhighincomeisbasedoneconomicfactors.

SupervisedDiscretization

Ifclassificationisourapplicationandclasslabelsareknownforsomedataobjects,thendiscretizationapproachesthatuseclasslabelsoftenproducebetterclassification.Thisshouldnotbesurprising,sinceanintervalconstructedwithnoknowledgeofclasslabelsoftencontainsamixtureofclasslabels.Aconceptuallysimpleapproachistoplacethesplitsinawaythatmaximizesthepurityoftheintervals,i.e.,theextenttowhichanintervalcontainsasingleclasslabel.Inpractice,however,suchanapproachrequirespotentiallyarbitrarydecisionsaboutthepurityofanintervalandtheminimumsizeofaninterval.

Toovercomesuchconcerns,somestatisticallybasedapproachesstartwitheachattributevalueinaseparateintervalandcreatelargerintervalsbymergingadjacentintervalsthataresimilaraccordingtoastatisticaltest.Analternativetothisbottom-upapproachisatop-downapproachthatstartsbybisectingtheinitialvaluessothattheresultingtwointervalsgiveminimumentropy.Thistechniqueonlyneedstoconsidereachvalueasapossiblesplitpoint,becauseitisassumedthatintervalscontainorderedsetsofvalues.Thesplittingprocessisthenrepeatedwithanotherinterval,typicallychoosingtheintervalwiththeworst(highest)entropy,untilauser-specifiednumberofintervalsisreached,orastoppingcriterionissatisfied.

Entropy-basedapproachesareoneofthemostpromisingapproachestodiscretization,whetherbottom-uportop-down.First,itisnecessarytodefineentropy.Letkbethenumberofdifferentclasslabels,m bethenumberof

valuesinthei intervalofapartition,andm bethenumberofvaluesofclassjinintervali.Thentheentropye ofthei intervalisgivenbytheequation

where istheprobability(fractionofvalues)ofclassjintheinterval.Thetotalentropy,e,ofthepartitionistheweightedaverageoftheindividualintervalentropies,i.e.,

wheremisthenumberofvalues, isthefractionofvaluesintheinterval,andnisthenumberofintervals.Intuitively,theentropyofanintervalisameasureofthepurityofaninterval.Ifanintervalcontainsonlyvaluesofoneclass(isperfectlypure),thentheentropyis0anditcontributesnothingtotheoverallentropy.Iftheclassesofvaluesinanintervaloccurequallyoften(theintervalisasimpureaspossible),thentheentropyisamaximum.

Example2.13(DiscretizationofTwoAttributes).Thetop-downmethodbasedonentropywasusedtoindependentlydiscretizeboththexandyattributesofthetwo-dimensionaldatashowninFigure2.14 .Inthefirstdiscretization,showninFigure2.14(a) ,thexandyattributeswerebothsplitintothreeintervals.(Thedashedlinesindicatethesplitpoints.)Intheseconddiscretization,showninFigure2.14(b) ,thexandyattributeswerebothsplitintofiveintervals.

i

thij

ith

ei=−∑j=1kpijlog2pij,

pij=mij/mi ith

e=∑i=1nwiei,

wi=mi/m ith

Figure2.14.Discretizingxandyattributesforfourgroups(classes)ofpoints.

Thissimpleexampleillustratestwoaspectsofdiscretization.First,intwodimensions,theclassesofpointsarewellseparated,butinonedimension,thisisnotso.Ingeneral,discretizingeachattributeseparatelyoftenguaranteessuboptimalresults.Second,fiveintervalsworkbetterthanthree,butsixintervalsdonotimprovethediscretizationmuch,atleastintermsofentropy.(Entropyvaluesandresultsforsixintervalsarenotshown.)Consequently,itisdesirabletohaveastoppingcriterionthatautomaticallyfindstherightnumberofpartitions.

CategoricalAttributeswithTooManyValuesCategoricalattributescansometimeshavetoomanyvalues.Ifthecategoricalattributeisanordinalattribute,thentechniquessimilartothoseforcontinuousattributescanbeusedtoreducethenumberofcategories.Ifthecategoricalattributeisnominal,however,thenotherapproachesareneeded.Considera

universitythathasalargenumberofdepartments.Consequently,adepartmentnameattributemighthavedozensofdifferentvalues.Inthissituation,wecoulduseourknowledgeoftherelationshipsamongdifferentdepartmentstocombinedepartmentsintolargergroups,suchasengineering,socialsciences,orbiologicalsciences.Ifdomainknowledgedoesnotserveasausefulguideorsuchanapproachresultsinpoorclassificationperformance,thenitisnecessarytouseamoreempiricalapproach,suchasgroupingvaluestogetheronlyifsuchagroupingresultsinimprovedclassificationaccuracyorachievessomeotherdataminingobjective.

2.3.7VariableTransformation

Avariabletransformationreferstoatransformationthatisappliedtoallthevaluesofavariable.(Weusethetermvariableinsteadofattributetoadheretocommonusage,althoughwewillalsorefertoattributetransformationonoccasion.)Inotherwords,foreachobject,thetransformationisappliedtothevalueofthevariableforthatobject.Forexample,ifonlythemagnitudeofavariableisimportant,thenthevaluesofthevariablecanbetransformedbytakingtheabsolutevalue.Inthefollowingsection,wediscusstwoimportanttypesofvariabletransformations:simplefunctionaltransformationsandnormalization.

SimpleFunctionsForthistypeofvariabletransformation,asimplemathematicalfunctionisappliedtoeachvalueindividually.Ifxisavariable,thenexamplesofsuchtransformationsinclude or Instatistics,variabletransformations,especiallysqrt,log,and1/x,areoftenusedtotransformdatathatdoesnothaveaGaussian(normal)distributionintodatathatdoes.While

xk,logx,ex,x,1/x,sinx, |x|.

thiscanbeimportant,otherreasonsoftentakeprecedenceindatamining.Supposethevariableofinterestisthenumberofdatabytesinasession,andthenumberofbytesrangesfrom1to1billion.Thisisahugerange,anditcanbeadvantageoustocompressitbyusingalog transformation.Inthiscase,sessionsthattransferred and byteswouldbemoresimilartoeachotherthansessionsthattransferred10and1000bytesForsomeapplications,suchasnetworkintrusiondetection,thismaybewhatisdesired,sincethefirsttwosessionsmostlikelyrepresenttransfersoflargefiles,whilethelattertwosessionscouldbetwoquitedistincttypesofsessions.

Variabletransformationsshouldbeappliedwithcautionbecausetheychangethenatureofthedata.Whilethisiswhatisdesired,therecanbeproblemsifthenatureofthetransformationisnotfullyappreciated.Forinstance,thetransformation1/xreducesthemagnitudeofvaluesthatare1orlarger,butincreasesthemagnitudeofvaluesbetween0and1.Toillustrate,thevalues{1,2,3}goto butthevalues goto{1,2,3}.Thus,forallsetsofvalues,thetransformation1/xreversestheorder.Tohelpclarifytheeffectofatransformation,itisimportanttoaskquestionssuchasthefollowing:Whatisthedesiredpropertyofthetransformedattribute?Doestheorderneedtobemaintained?Doesthetransformationapplytoallvalues,especiallynegativevaluesand0?Whatistheeffectofthetransformationonthevaluesbetween0and1?Exercise17 onpage109 exploresotheraspectsofvariabletransformation.

NormalizationorStandardizationThegoalofstandardizationornormalizationistomakeanentiresetofvalueshaveaparticularproperty.Atraditionalexampleisthatof“standardizingavariable”instatistics.If isthemean(average)oftheattributevaluesandistheirstandarddeviation,thenthetransformation createsanew

10

108 109(9−8=1versus3−1=3).

{1,12,13}, {1,12,13}

x¯ sxx′=(x−x¯)/sx

variablethathasameanof0andastandarddeviationof1.Ifdifferentvariablesaretobeusedtogether,e.g.,forclustering,thensuchatransformationisoftennecessarytoavoidhavingavariablewithlargevaluesdominatetheresultsoftheanalysis.Toillustrate,considercomparingpeoplebasedontwovariables:ageandincome.Foranytwopeople,thedifferenceinincomewilllikelybemuchhigherinabsoluteterms(hundredsorthousandsofdollars)thanthedifferenceinage(lessthan150).Ifthedifferencesintherangeofvaluesofageandincomearenottakenintoaccount,thenthecomparisonbetweenpeoplewillbedominatedbydifferencesinincome.Inparticular,ifthesimilarityordissimilarityoftwopeopleiscalculatedusingthesimilarityordissimilaritymeasuresdefinedlaterinthischapter,theninmanycases,suchasthatofEuclideandistance,theincomevalueswilldominatethecalculation.

Themeanandstandarddeviationarestronglyaffectedbyoutliers,sotheabovetransformationisoftenmodified.First,themeanisreplacedbythemedian,i.e.,themiddlevalue.Second,thestandarddeviationisreplacedbytheabsolutestandarddeviation.Specifically,ifxisavariable,thentheabsolutestandarddeviationofxisgivenby where isthevalueofthevariable,misthenumberofobjects,and iseitherthemean

ormedian.Otherapproachesforcomputingestimatesofthelocation(center)andspreadofasetofvaluesinthepresenceofoutliersaredescribedinstatisticsbooks.Thesemorerobustmeasurescanalsobeusedtodefineastandardizationtransformation.

σA=∑i=1m|xi−μ|, xiith μ

2.4MeasuresofSimilarityandDissimilaritySimilarityanddissimilarityareimportantbecausetheyareusedbyanumberofdataminingtechniques,suchasclustering,nearestneighborclassification,andanomalydetection.Inmanycases,theinitialdatasetisnotneededoncethesesimilaritiesordissimilaritieshavebeencomputed.Suchapproachescanbeviewedastransformingthedatatoasimilarity(dissimilarity)spaceandthenperformingtheanalysis.Indeed,kernelmethodsareapowerfulrealizationofthisidea.ThesemethodsareintroducedinSection2.4.7 andarediscussedmorefullyinthecontextofclassificationinSection4.9.4.

Webeginwithadiscussionofthebasics:high-leveldefinitionsofsimilarityanddissimilarity,andadiscussionofhowtheyarerelated.Forconvenience,thetermproximityisusedtorefertoeithersimilarityordissimilarity.Sincetheproximitybetweentwoobjectsisafunctionoftheproximitybetweenthecorrespondingattributesofthetwoobjects,wefirstdescribehowtomeasuretheproximitybetweenobjectshavingonlyoneattribute.

Wethenconsiderproximitymeasuresforobjectswithmultipleattributes.ThisincludesmeasuressuchastheJaccardandcosinesimilaritymeasures,whichareusefulforsparsedata,suchasdocuments,aswellascorrelationandEuclideandistance,whichareusefulfornon-sparse(dense)data,suchastimeseriesormulti-dimensionalpoints.Wealsoconsidermutualinformation,whichcanbeappliedtomanytypesofdataandisgoodfordetectingnonlinearrelationships.Inthisdiscussion,werestrictourselvestoobjectswithrelativelyhomogeneousattributetypes,typicallybinaryorcontinuous.

Next,weconsiderseveralimportantissuesconcerningproximitymeasures.Thisincludeshowtocomputeproximitybetweenobjectswhentheyhaveheterogeneoustypesofattributes,andapproachestoaccountfordifferencesofscaleandcorrelationamongvariableswhencomputingdistancebetweennumericalobjects.Thesectionconcludeswithabriefdiscussionofhowtoselecttherightproximitymeasure.

Althoughthissectionfocusesonthecomputationofproximitybetweendataobjects,proximitycanalsobecomputedbetweenattributes.Forexample,forthedocument-termmatrixofFigure2.2(d) ,thecosinemeasurecanbeusedtocomputesimilaritybetweenapairofdocumentsorapairofterms(words).Knowingthattwovariablesarestronglyrelatedcan,forexample,behelpfulforeliminatingredundancy.Inparticular,thecorrelationandmutualinformationmeasuresdiscussedlaterareoftenusedforthatpurpose.

2.4.1Basics

DefinitionsInformally,thesimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsarealike.Consequently,similaritiesarehigherforpairsofobjectsthataremorealike.Similaritiesareusuallynon-negativeandareoftenbetween0(nosimilarity)and1(completesimilarity).

Thedissimilaritybetweentwoobjectsisanumericalmeasureofthedegreetowhichthetwoobjectsaredifferent.Dissimilaritiesarelowerformoresimilarpairsofobjects.Frequently,thetermdistanceisusedasasynonymfordissimilarity,although,asweshallsee,distanceoftenreferstoaspecialclass

ofdissimilarities.Dissimilaritiessometimesfallintheinterval[0,1],butitisalsocommonforthemtorangefrom0to∞.

TransformationsTransformationsareoftenappliedtoconvertasimilaritytoadissimilarity,orviceversa,ortotransformaproximitymeasuretofallwithinaparticularrange,suchas[0,1].Forinstance,wemayhavesimilaritiesthatrangefrom1to10,buttheparticularalgorithmorsoftwarepackagethatwewanttousemaybedesignedtoworkonlywithdissimilarities,oritmayworkonlywithsimilaritiesintheinterval[0,1].Wediscusstheseissuesherebecausewewillemploysuchtransformationslaterinourdiscussionofproximity.Inaddition,theseissuesarerelativelyindependentofthedetailsofspecificproximitymeasures.

Frequently,proximitymeasures,especiallysimilarities,aredefinedortransformedtohavevaluesintheinterval[0,1].Informally,themotivationforthisistouseascaleinwhichaproximityvalueindicatesthefractionofsimilarity(ordissimilarity)betweentwoobjects.Suchatransformationisoftenrelativelystraightforward.Forexample,ifthesimilaritiesbetweenobjectsrangefrom1(notatallsimilar)to10(completelysimilar),wecanmakethemfallwithintherange[0,1]byusingthetransformation wheresands′aretheoriginalandnewsimilarityvalues,respectively.Inthemoregeneralcase,thetransformationofsimilaritiestotheinterval[0,1]isgivenbytheexpression wheremax_sandmin_sarethemaximumandminimumsimilarityvalues,respectively.Likewise,dissimilaritymeasureswithafiniterangecanbemappedtotheinterval[0,1]byusingtheformula Thisisanexampleofalineartransformation,whichpreservestherelativedistancesbetweenpoints.Inotherwords,ifpoints, and aretwiceasfarapartaspoints, andthesamewillbetrueafteralineartransformation.

s′=(s−1)/9,

s′=(s−min_s)/(max_s−min_s),

d′=(d−min_d)/(max_d−min_d).

x1 x2, x3 x4,

However,therecanbecomplicationsinmappingproximitymeasurestotheinterval[0,1]usingalineartransformation.If,forexample,theproximitymeasureoriginallytakesvaluesintheinterval thenmax_disnotdefinedandanonlineartransformationisneeded.Valueswillnothavethesamerelationshiptooneanotheronthenewscale.Considerthetransformation

foradissimilaritymeasurethatrangesfrom0to Thedissimilarities0,0.5,2,10,100,and1000willbetransformedintothenewdissimilarities0,0.33,0.67,0.90,0.99,and0.999,respectively.Largervaluesontheoriginaldissimilarityscalearecompressedintotherangeofvaluesnear1,butwhetherthisisdesirabledependsontheapplication.

Notethatmappingproximitymeasurestotheinterval[0,1]canalsochangethemeaningoftheproximitymeasure.Forexample,correlation,whichisdiscussedlater,isameasureofsimilaritythattakesvaluesintheinterval

Mappingthesevaluestotheinterval[0,1]bytakingtheabsolutevaluelosesinformationaboutthesign,whichcanbeimportantinsomeapplications.SeeExercise22 onpage111 .

Transformingsimilaritiestodissimilaritiesandviceversaisalsorelativelystraightforward,althoughweagainfacetheissuesofpreservingmeaningandchangingalinearscaleintoanonlinearscale.Ifthesimilarity(ordissimilarity)fallsintheinterval[0,1],thenthedissimilaritycanbedefinedas

Anothersimpleapproachistodefinesimilarityasthenegativeofthedissimilarity(orviceversa).Toillustrate,thedissimilarities0,1,10,and100canbetransformedintothesimilarities andrespectively.

Thesimilaritiesresultingfromthenegationtransformationarenotrestrictedtotherange[0,1],butifthatisdesired,thentransformationssuchas

or canbeused.Forthetransformation thedissimilarities0,1,10,100aretransformedinto1,

[0,∞],

d=d/(1+d) ∞.

[−1,1].

d=1−s(s=1−d).

0,−1,−10, −100,

s=1d+1,s=e−d, s=1−d−min_dmax_d−min_ds=1d+1,

0.5,0.09,0.01,respectively.For theybecome1.00,0.37,0.00,0.00,respectively,whilefor theybecome1.00,0.99,0.90,0.00,respectively.Inthisdiscussion,wehavefocusedonconvertingdissimilaritiestosimilarities.ConversionintheoppositedirectionisconsideredinExercise23 onpage111 .

Ingeneral,anymonotonicdecreasingfunctioncanbeusedtoconvertdissimilaritiestosimilarities,orviceversa.Ofcourse,otherfactorsalsomustbeconsideredwhentransformingsimilaritiestodissimilarities,orviceversa,orwhentransformingthevaluesofaproximitymeasuretoanewscale.Wehavementionedissuesrelatedtopreservingmeaning,distortionofscale,andrequirementsofdataanalysistools,butthislistiscertainlynotexhaustive.

2.4.2SimilarityandDissimilaritybetweenSimpleAttributes

Theproximityofobjectswithanumberofattributesistypicallydefinedbycombiningtheproximitiesofindividualattributes,andthus,wefirstdiscussproximitybetweenobjectshavingasingleattribute.Considerobjectsdescribedbyonenominalattribute.Whatwoulditmeanfortwosuchobjectstobesimilar?Becausenominalattributesconveyonlyinformationaboutthedistinctnessofobjects,allwecansayisthattwoobjectseitherhavethesamevalueortheydonot.Hence,inthiscasesimilarityistraditionallydefinedas1ifattributevaluesmatch,andas0otherwise.Adissimilaritywouldbedefinedintheoppositeway:0iftheattributevaluesmatch,and1iftheydonot.

Forobjectswithasingleordinalattribute,thesituationismorecomplicatedbecauseinformationaboutordershouldbetakenintoaccount.Consideran

s=e−d,s=1−d−min_dmax_d−min_d

attributethatmeasuresthequalityofaproduct,e.g.,acandybar,onthescale{poor,fair,OK,good,wonderful}.Itwouldseemreasonablethataproduct,P1,whichisratedwonderful,wouldbeclosertoaproductP2,whichisratedgood,thanitwouldbetoaproductP3,whichisratedOK.Tomakethisobservationquantitative,thevaluesoftheordinalattributeareoftenmappedtosuccessiveintegers,beginningat0or1,e.g.,

Then, or,ifwewantthedissimilaritytofallbetween0and Asimilarityforordinalattributescanthenbedefinedas

Thisdefinitionofsimilarity(dissimilarity)foranordinalattributeshouldmakethereaderabituneasysincethisassumesequalintervalsbetweensuccessivevaluesoftheattribute,andthisisnotnecessarilyso.Otherwise,wewouldhaveanintervalorratioattribute.IsthedifferencebetweenthevaluesfairandgoodreallythesameasthatbetweenthevaluesOKandwonderful?Probablynot,butinpractice,ouroptionsarelimited,andintheabsenceofmoreinformation,thisisthestandardapproachfordefiningproximitybetweenordinalattributes.

Forintervalorratioattributes,thenaturalmeasureofdissimilaritybetweentwoobjectsistheabsolutedifferenceoftheirvalues.Forexample,wemightcompareourcurrentweightandourweightayearagobysaying“Iamtenpoundsheavier.”Incasessuchasthese,thedissimilaritiestypicallyrangefrom0to ratherthanfrom0to1.Thesimilarityofintervalorratioattributesistypicallyexpressedbytransformingadissimilarityintoasimilarity,aspreviouslydescribed.

Table2.7 summarizesthisdiscussion.Inthistable,xandyaretwoobjectsthathaveoneattributeoftheindicatedtype.Also,d(x,y)ands(x,y)arethedissimilarityandsimilaritybetweenxandy,respectively.Otherapproachesarepossible;thesearethemostcommonones.

{poor=0,fair=1,OK=2,good=3,wonderful=4}. d(P1,P2)=3−2=1d(P1,P2)=3−24=0.25.

s=1−d.

∞,

Table2.7.Similarityanddissimilarityforsimpleattributes

AttributeType

Dissimilarity Similarity

Nominal

Ordinal (valuesmappedtointegers0to ,wherenisthenumberofvalues)

IntervalorRatio

Thefollowingtwosectionsconsidermorecomplicatedmeasuresofproximitybetweenobjectsthatinvolvemultipleattributes:(1)dissimilaritiesbetweendataobjectsand(2)similaritiesbetweendataobjects.Thisdivisionallowsustomorenaturallydisplaytheunderlyingmotivationsforemployingvariousproximitymeasures.Weemphasize,however,thatsimilaritiescanbetransformedintodissimilaritiesandviceversausingtheapproachesdescribedearlier.

2.4.3DissimilaritiesbetweenDataObjects

Inthissection,wediscussvariouskindsofdissimilarities.Webeginwithadiscussionofdistances,whicharedissimilaritieswithcertainproperties,andthenprovideexamplesofmoregeneralkindsofdissimilarities.

Distances

d={0ifx=y1ifx≠y s={1ifx=y0ifx≠y

d=|x−y|/(n−1)n−1

s=1−d

d=|x−y| s=−d,s=11+d,s=e−d,s=1−d−min_dmax_d−min_d

Wefirstpresentsomeexamples,andthenofferamoreformaldescriptionofdistancesintermsofthepropertiescommontoalldistances.TheEuclideandistance,d,betweentwopoints,xandy,inone-,two-,three-,orhigher-dimensionalspace,isgivenbythefollowingfamiliarformula:

wherenisthenumberofdimensionsand and are,respectively,theattributes(components)ofxandy.WeillustratethisformulawithFigure2.15 andTables2.8 and2.9 ,whichshowasetofpoints,thexandycoordinatesofthesepoints,andthedistancematrixcontainingthepairwisedistancesofthesepoints.

Figure2.15.Fourtwo-dimensionalpoints.

TheEuclideandistancemeasuregiveninEquation2.1 isgeneralizedbytheMinkowskidistancemetricshowninEquation2.2 ,

d(x,y)=∑k=1n(xk−yk)2, (2.1)

xk yk kth

d(x,y)=(∑k=1n|xk−yk|r)1/r, (2.2)

whererisaparameter.ThefollowingarethethreemostcommonexamplesofMinkowskidistances.

Cityblock(Manhattan,taxicab, norm)distance.AcommonexampleistheHammingdistance,whichisthenumberofbitsthatisdifferentbetweentwoobjectsthathaveonlybinaryattributes,i.e.,betweentwobinaryvectors.

Euclideandistance( norm).Supremum( or norm)distance.Thisisthemaximum

differencebetweenanyattributeoftheobjects.Moreformally,thedistanceisdefinedbyEquation2.3

Therparametershouldnotbeconfusedwiththenumberofdimensions(at-tributes)n.TheEuclidean,Manhattan,andsupremumdistancesaredefinedforallvaluesofn:1,2,3,…,andspecifydifferentwaysofcombiningthedifferencesineachdimension(attribute)intoanoveralldistance.

Tables2.10 and2.11 ,respectively,givetheproximitymatricesfortheand distancesusingdatafromTable2.8 .Noticethatallthesedistancematricesaresymmetric;i.e.,the entryisthesameasthe entry.InTable2.9 ,forinstance,thefourthrowofthefirstcolumnandthefourthcolumnofthefirstrowbothcontainthevalue5.1.

Table2.8.xandycoordinatesoffourpoints.

point xcoordinate ycoordinate

p1 0 2

p2 2 0

p3 3 1

r=1. L1

r=2. L2r=∞. Lmax L∞

L∞

d(x,y)=limr→∞(∑k=1n|xk−yk|r)1/r. (2.3)

L1L∞

ijth jith

p4 5 1

Table2.9.EuclideandistancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.8 3.2 5.1

p2 2.8 0.0 1.4 3.2

p3 3.2 1.4 0.0 2.0

p4 5.1 3.2 2.0 0.0

Table2.10. distancematrixforTable2.8 .

L p1 p2 p3 p4

p1 0.0 4.0 4.0 6.0

p2 4.0 0.0 2.0 4.0

p3 4.0 2.0 0.0 2.0

p4 6.0 4.0 2.0 0.0

Table2.11. distancematrixforTable2.8 .

p1 p2 p3 p4

p1 0.0 2.0 3.0 5.0

p2 2.0 0.0 1.0 3.0

p3 3.0 1.0 0.0 2.0

L1

1

L∞

L∞

p4 5.0 3.0 2.0 0.0

Distances,suchastheEuclideandistance,havesomewell-knownproperties.Ifd(x,y)isthedistancebetweentwopoints,xandy,thenthefollowingpropertieshold.

1. Positivitya. forallxandy,b. onlyif

2. Symmetry forallxandy.3. TriangleInequality forallpointsx,y,andz.

Measuresthatsatisfyallthreepropertiesareknownasmetrics.Somepeopleusethetermdistanceonlyfordissimilaritymeasuresthatsatisfytheseproperties,butthatpracticeisoftenviolated.Thethreepropertiesdescribedhereareuseful,aswellasmathematicallypleasing.Also,ifthetriangleinequalityholds,thenthispropertycanbeusedtoincreasetheefficiencyoftechniques(includingclustering)thatdependondistancespossessingthisproperty.(SeeExercise25 .)Nonetheless,manydissimilaritiesdonotsatisfyoneormoreofthemetricproperties.Example2.14 illustratessuchameasure.

Example2.14(Non-metricDissimilarities:SetDifferences).Thisexampleisbasedonthenotionofthedifferenceoftwosets,asdefinedinsettheory.GiventwosetsAandB, isthesetofelementsofAthatarenotin

d(x,y)≥0d(x,y)=0 x=y.

d(x,y)=d(y,x)d(x,z)≤d(x,y)+d(y,z)

A−B

B.Forexample,if and then andtheemptyset.WecandefinethedistancedbetweentwosetsA

andBas wheresizeisafunctionreturningthenumberofelementsinaset.Thisdistancemeasure,whichisanintegervaluegreaterthanorequalto0,doesnotsatisfythesecondpartofthepositivityproperty,thesymmetryproperty,orthetriangleinequality.However,thesepropertiescanbemadetoholdifthedissimilaritymeasureismodifiedasfollows: SeeExercise21 onpage110 .

2.4.4SimilaritiesbetweenDataObjects

Forsimilarities,thetriangleinequality(ortheanalogousproperty)typicallydoesnothold,butsymmetryandpositivitytypicallydo.Tobeexplicit,ifs(x,y)isthesimilaritybetweenpointsxandy,thenthetypicalpropertiesofsimilaritiesarethefollowing:

1. onlyif2. forallxandy.(Symmetry)

Thereisnogeneralanalogofthetriangleinequalityforsimilaritymeasures.Itissometimespossible,however,toshowthatasimilaritymeasurecaneasilybeconvertedtoametricdistance.ThecosineandJaccardsimilaritymeasures,whicharediscussedshortly,aretwoexamples.Also,forspecificsimilaritymeasures,itispossibletoderivemathematicalboundsonthesimilaritybetweentwoobjectsthataresimilarinspirittothetriangleinequality.

Example2.15(ANon-symmetricSimilarity

A={1,2,3,4} B={2,3,4}, A−B={1} B−A=∅,

d(A,B)=size(A−B),

d(A,B)=size(A−B)+size(B−A).

s(x,y)=1 x=y.(0≤s≤1)s(x,y)=s(y,x)

Measure).Consideranexperimentinwhichpeopleareaskedtoclassifyasmallsetofcharactersastheyflashonascreen.Theconfusionmatrixforthisexperimentrecordshowofteneachcharacterisclassifiedasitself,andhowofteneachisclassifiedasanothercharacter.Usingtheconfusionmatrix,wecandefineasimilaritymeasurebetweenacharacterxandacharacteryasthenumberoftimesthatxismisclassifiedasy,butnotethatthismeasureisnotsymmetric.Forexample,supposethat“0”appeared200timesandwasclassifiedasa“0”160times,butasan“o”40times.Likewise,supposethat“o”appeared200timesandwasclassifiedasan“o”170times,butas“0”only30times.Then, but

Insuchsituations,thesimilaritymeasurecanbemadesymmetricbysetting wheresindicatesthenewsimilaritymeasure.

2.4.5ExamplesofProximityMeasures

Thissectionprovidesspecificexamplesofsomesimilarityanddissimilaritymeasures.

SimilarityMeasuresforBinaryDataSimilaritymeasuresbetweenobjectsthatcontainonlybinaryattributesarecalledsimilaritycoefficients,andtypicallyhavevaluesbetween0and1.Avalueof1indicatesthatthetwoobjectsarecompletelysimilar,whileavalueof0indicatesthattheobjectsarenotatallsimilar.Therearemanyrationalesforwhyonecoefficientisbetterthananotherinspecificinstances.

s(0,o)=40,s(o,0)=30.

s′=(x,y)=s′(x,y)=(s(x,y+s(y,x))/2,

Letxandybetwoobjectsthatconsistofnbinaryattributes.Thecomparisonoftwosuchobjects,i.e.,twobinaryvectors,leadstothefollowingfourquantities(frequencies):

SimpleMatchingCoefficient

Onecommonlyusedsimilaritycoefficientisthesimplematchingcoefficient(SMC),whichisdefinedas

Thismeasurecountsbothpresencesandabsencesequally.Consequently,theSMCcouldbeusedtofindstudentswhohadansweredquestionssimilarlyonatestthatconsistedonlyoftrue/falsequestions.

JaccardCoefficient

Supposethatxandyaredataobjectsthatrepresenttworows(twotransactions)ofatransactionmatrix(seeSection2.1.2 ).Ifeachasymmetricbinaryattributecorrespondstoaniteminastore,thena1indicatesthattheitemwaspurchased,whilea0indicatesthattheproductwasnotpurchased.Becausethenumberofproductsnotpurchasedbyanycustomerfaroutnumbersthenumberofproductsthatwerepurchased,asimilaritymeasuresuchasSMCwouldsaythatalltransactionsareverysimilar.Asaresult,theJaccardcoefficientisfrequentlyusedtohandleobjectsconsistingofasymmetricbinaryattributes.TheJaccardcoefficient,whichisoftensymbolizedbyj,isgivenbythefollowingequation:

f00=thenumberofattributeswherexis0andyis0f01=thenumberofattributeswhere

SMC=numberofmatchingattributevaluesnumberofattributes=f11+f00f01+f10(2.4)

J=numberofmatchingpresencesnumberofattributesnotinvolvedin00matches(2.5)

Example2.16(TheSMCandJaccardSimilarityCoefficients).Toillustratethedifferencebetweenthesetwosimilaritymeasures,wecalculateSMCandjforthefollowingtwobinaryvectors.

CosineSimilarityDocumentsareoftenrepresentedasvectors,whereeachcomponent(attribute)representsthefrequencywithwhichaparticularterm(word)occursinthedocument.Eventhoughdocumentshavethousandsortensofthousandsofattributes(terms),eachdocumentissparsesinceithasrelativelyfewnonzeroattributes.Thus,aswithtransactiondata,similarityshouldnotdependonthenumberofshared0valuesbecauseanytwodocumentsarelikelyto“notcontain”manyofthesamewords,andtherefore,if0–0matchesarecounted,mostdocumentswillbehighlysimilartomostotherdocuments.Therefore,asimilaritymeasurefordocumentsneedstoignores0–0matchesliketheJaccardmeasure,butalsomustbeabletohandlenon-binaryvectors.Thecosinesimilarity,definednext,isoneofthemostcommonmeasuresofdocumentsimilarity.Ifxandyaretwodocumentvectors,then

x=(1,0,0,0,0,0,0,0,0,0)y=(0,0,0,0,0,0,1,0,0,1)

f01=2thenumberofattributeswherexwas0andywas1f10=1thenumberofattributeswhere

SMC=f11+f00f01+f10+f11+f00=0+72+1+0+7=0.7

J=f11f01+f10+f11=02+1+0=0

where′indicatesvectorormatrixtransposeand indicatestheinnerproductofthetwovectors,

and isthelengthofvector

Theinnerproductoftwovectorsworkswellforasymmetricattributessinceitdependsonlyoncomponentsthatarenon-zeroinbothvectors.Hence,thesimilaritybetweentwodocumentsdependsonlyuponthewordsthatappearinbothofthem.

Example2.17(CosineSimilaritybetweenTwoDocumentVectors).Thisexamplecalculatesthecosinesimilarityforthefollowingtwodataobjects,whichmightrepresentdocumentvectors:

AsindicatedbyFigure2.16 ,cosinesimilarityreallyisameasureofthe(cosineofthe)anglebetweenxandy.Thus,ifthecosinesimilarityis1,theanglebetweenxandyis andxandyarethesameexceptforlength.Ifthecosinesimilarityis0,thentheanglebetweenxandyis andtheydonotshareanyterms(words).

cos(x,y)=⟨x,y⟩∥x∥∥y∥=x′y∥x∥∥y∥, (2.6)

⟨x,y⟩

⟨x,y⟩=∑k=1nxkyk=x′y, (2.7)

∥x∥ x,∥x∥=∑k=1nxk2=⟨x,x⟩=x′x.

x=(3,2,0,5,0,0,0,2,0,0)y=(1,0,0,0,0,0,0,1,0,2)

⟨x,y⟩=3×1+2×0+0×0+5×0+0×0+0×0+0×0+2×1+0×0×2=5∥x∥=3×3+2×2+0×0+5×5

0°,90°,

Figure2.16.Geometricillustrationofthecosinemeasure.

Equation2.6 alsocanbewrittenasEquation2.8 .

where and Dividingxandybytheirlengthsnormalizesthemtohavealengthof1.Thismeansthatcosinesimilaritydoesnottakethelengthofthetwodataobjectsintoaccountwhencomputingsimilarity.(Euclideandistancemightbeabetterchoicewhenlengthisimportant.)Forvectorswithalengthof1,thecosinemeasurecanbecalculatedbytakingasimpleinnerproduct.Consequently,whenmanycosinesimilaritiesbetweenobjectsarebeingcomputed,normalizingtheobjectstohaveunitlengthcanreducethetimerequired.

ExtendedJaccardCoefficient(TanimotoCoefficient)TheextendedJaccardcoefficientcanbeusedfordocumentdataandthatreducestotheJaccardcoefficientinthecaseofbinaryattributes.Thiscoefficient,whichweshallrepresentasEJ,isdefinedbythefollowingequation:

cos(x,y)=⟨x∥x∥,y∥y∥⟩=⟨x′,y′⟩, (2.8)

x′=x/∥x∥ y′=y/∥y∥.

EJ(x,y)=⟨x,y⟩ǁxǁ2+ǁyǁ2−⟨x,y⟩=x′yǁxǁ2+ǁyǁ2−x′y. (2.9)

CorrelationCorrelationisfrequentlyusedtomeasurethelinearrelationshipbetweentwosetsofvaluesthatareobservedtogether.Thus,correlationcanmeasuretherelationshipbetweentwovariables(heightandweight)orbetweentwoobjects(apairoftemperaturetimeseries).Correlationisusedmuchmorefrequentlytomeasurethesimilaritybetweenattributessincethevaluesintwodataobjectscomefromdifferentattributes,whichcanhaveverydifferentattributetypesandscales.Therearemanytypesofcorrelation,andindeedcorrelationissometimesusedinageneralsensetomeantherelationshipbetweentwosetsofvaluesthatareobservedtogether.Inthisdiscussion,wewillfocusonameasureappropriatefornumericalvalues.

Specifically,Pearson’scorrelationbetweentwosetsofnumericalvalues,i.e.,twovectors,xandy,isdefinedbythefollowingequation:

whereweusethefollowingstandardstatisticalnotationanddefinitions:

corr(x,y)=covariance(x,y)standard_deviation(x)×standard_deviation(y)=sxysx(2.10)

covariance(x,y)=sxy=1n−1∑k=1n(xk−x¯)(yk−y¯) (2.11)

standard_deviation(x)=sx=1n−1∑k=1n(xk−x¯)2

standard_deviation(y)=sy=1n−1∑k=1n(yk−y¯)2

x¯=1n∑k=1nxkisthemeanofx

y¯=1n∑k=1nykisthemeanofy

Example2.18(PerfectCorrelation).Correlationisalwaysintherange to1.Acorrelationof meansthatxandyhaveaperfectpositive(negative)linearrelationship;thatis,

whereaandbareconstants.Thefollowingtwovectorsxandyillustratecaseswherethecorrelationis and respectively.Inthefirstcase,themeansofxandywerechosentobe0,forsimplicity.

Example2.19(NonlinearRelationships).Ifthecorrelationis0,thenthereisnolinearrelationshipbetweenthetwosetsofvalues.However,nonlinearrelationshipscanstillexist.Inthefollowingexample, buttheircorrelationis0.

Example2.20(VisualizingCorrelation).Itisalsoeasytojudgethecorrelationbetweentwovectorsxandybyplottingpairsofcorrespondingvaluesofxandyinascatterplot.Figure2.17 showsanumberofthesescatterplotswhenxandyconsistofasetof30pairsofvaluesthatarerandomlygenerated(withanormaldistribution)sothatthecorrelationofxandyrangesfrom to1.Eachcircleinaplotrepresentsoneofthe30pairsofxandyvalues;itsxcoordinateisthevalueofthatpairforx,whileitsycoordinateisthevalueofthesamepairfory.

−1 1(−1)

xk=ayk+b,−1 +1,

x=(−3,6,0,3,−6)y=(1,−2,0,−1,2)corr(x,y)=−1xk=−3yk

x=(3,6,0,3,6)y=(1,2,0,1,2)corr(x,y)=1xk=3yk

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

−1

Figure2.17.Scatterplotsillustratingcorrelationsfrom to1.

Ifwetransformxandybysubtractingofftheirmeansandthennormalizingthemsothattheirlengthsare1,thentheircorrelationcanbecalculatedbytakingthedotproduct.Letusrefertothesetransformedvectorsofxandyasand ,respectively.(Noticethatthistransformationisnotthesameasthe

standardizationusedinothercontexts,wherewesubtractthemeansanddividebythestandarddeviations,asdiscussedinSection2.3.7 .)Thistransformationhighlightsaninterestingrelationshipbetweenthecorrelationmeasureandthecosinemeasure.Specifically,thecorrelationbetweenxandyisidenticaltothecosinebetween and However,thecosinebetweenxandyisnotthesameasthecosinebetween and eventhoughtheybothhavethesamecorrelationmeasure.Ingeneral,thecorrelationbetweentwo

−1

x′ y′

x′ y′.x′ y′,

vectorsisequaltothecosinemeasureonlyinthespecialcasewhenthemeansofthetwovectorsare0.

DifferencesAmongMeasuresForContinuousAttributesInthissection,weillustratethedifferenceamongthethreeproximitymeasuresforcontinuousattributesthatwehavejustdefined:cosine,correlation,andMinkowskidistance.Specifically,weconsidertwotypesofdatatransformationsthatarecommonlyused,namely,scaling(multiplication)byaconstantfactorandtranslation(addition)byaconstantvalue.Aproximitymeasureisconsideredtobeinvarianttoadatatransformationifitsvalueremainsunchangedevenafterperformingthetransformation.Table2.12comparesthebehaviorofcosine,correlation,andMinkowskidistancemeasuresregardingtheirinvariancetoscalingandtranslationoperations.Itcanbeseenthatwhilecorrelationisinvarianttobothscalingandtranslation,cosineisonlyinvarianttoscalingbutnottotranslation.Minkowskidistancemeasures,ontheotherhand,aresensitivetobothscalingandtranslationandarethusinvarianttoneither.

Table2.12.Propertiesofcosine,correlation,andMinkowskidistancemeasures.

Property Cosine Correlation MinkowskiDistance

Invarianttoscaling(multiplication) Yes Yes No

Invarianttotranslation(addition) No Yes No

Letusconsideranexampletodemonstratethesignificanceofthesedifferencesamongdifferentproximitymeasures.

Example2.21(Comparingproximitymeasures).Considerthefollowingtwovectorsxandywithsevennumericattributes.

Itcanbeseenthatbothxandyhave4non-zerovalues,andthevaluesinthetwovectorsaremostlythesame,exceptforthethirdandthefourthcomponents.Thecosine,correlation,andEuclideandistancebetweenthetwovectorscanbecomputedasfollows.

Notsurprisingly,xandyhaveacosineandcorrelationmeasurecloseto1,whiletheEuclideandistancebetweenthemissmall,indicatingthattheyarequitesimilar.Nowletusconsiderthevector whichisascaledversionofy(multipliedbyaconstantfactorof2),andthevector whichisconstructedbytranslatingyby5unitsasfollows.

Weareinterestedinfindingwhether and showthesameproximitywithxasshownbytheoriginalvectory.Table2.13 showsthedifferentmeasuresofproximitycomputedforthepairs and Itcanbeseenthatthevalueofcorrelationbetweenxandyremainsunchangedevenafterreplacingywith or However,thevalueofcosineremainsequalto0.9667whencomputedfor(x,y)and butsignificantlyreducesto0.7940whencomputedfor Thishighlights

x=(1,2,4,3,0,0,0)y=(1,2,3,4,0,0,0)

cos(x,y)=2930×30=0.9667correlation(x,y)=2.35711.5811×1.5811=0.9429Euclideandistancex−yǁ=1.4142

ys,yt,

ys=2×y=(2,4,6,8,0,0,0)

yt=y+5=(6,7,8,9,5,5,5)

ys yt

(x,y),(x,ys), (x,yt).

ys yt.(x,ys),

(x,yt).

thefactthatcosineisinvarianttothescalingoperationbutnottothetranslationoperation,incontrastwiththecorrelationmeasure.TheEuclideandistance,ontheotherhand,showsdifferentvaluesforallthreepairsofvectors,asitissensitivetobothscalingandtranslation.

Table2.13.Similaritybetween and

Measure (x,y)

Cosine 0.9667 0.9667 0.7940

Correlation 0.9429 0.9429 0.9429

EuclideanDistance 1.4142 5.8310 14.2127

Wecanobservefromthisexamplethatdifferentproximitymeasuresbehavedifferentlywhenscalingortranslationoperationsareappliedonthedata.Thechoiceoftherightproximitymeasurethusdependsonthedesirednotionofsimilaritybetweendataobjectsthatismeaningfulforagivenapplication.Forexample,ifxandyrepresentedthefrequenciesofdifferentwordsinadocument-termmatrix,itwouldbemeaningfultouseaproximitymeasurethatremainsunchangedwhenyisreplacedbybecause isjustascaledversionofywiththesamedistributionofwordsoccurringinthedocument.However, isdifferentfromy,sinceitcontainsalargenumberofwordswithnon-zerofrequenciesthatdonotoccuriny.Becausecosineisinvarianttoscalingbutnottotranslation,itwillbeanidealchoiceofproximitymeasureforthisapplication.

Consideradifferentscenarioinwhichxrepresentsalocation’stemperaturemeasuredontheCelsiusscaleforsevendays.Let andbethetemperaturesmeasuredonthosedaysatadifferentlocation,but

usingthreedifferentmeasurementscales.Notethatdifferentunitsof

(x,y),(x,ys), (x,yt).

(x,ys) (x,yt)

ys,ys

yt

y,ys,yt

temperaturehavedifferentoffsets(e.g.CelsiusandKelvin)anddifferentscalingfactors(e.g.CelsiusandFahrenheit).Itisthusdesirabletouseaproximitymeasurethatcapturestheproximitybetweentemperaturevalueswithoutbeingaffectedbythemeasurementscale.Correlationwouldthenbetheidealchoiceofproximitymeasureforthisapplication,asitisinvarianttobothscalingandtranslation.

Asanotherexample,considerascenariowherexrepresentstheamountofprecipitation(incm)measuredatsevenlocations.Let and beestimatesoftheprecipitationattheselocations,whicharepredictedusingthreedifferentmodels.Ideally,wewouldliketochooseamodelthataccuratelyreconstructsthemeasurementsinxwithoutmakinganyerror.Itisevidentthatyprovidesagoodapproximationofthevaluesinx,whereasand providepoorestimatesofprecipitation,eventhoughtheydo

capturethetrendinprecipitationacrosslocations.Hence,weneedtochooseaproximitymeasurethatpenalizesanydifferenceinthemodelestimatesfromtheactualobservations,andissensitivetoboththescalingandtranslationoperations.TheEuclideandistancesatisfiesthispropertyandthuswouldbetherightchoiceofproximitymeasureforthisapplication.Indeed,theEuclideandistanceiscommonlyusedincomputingtheaccuracyofmodels,whichwillbediscussedlaterinChapter3 .

2.4.6MutualInformation

Likecorrelation,mutualinformationisusedasameasureofsimilaritybetweentwosetsofpairedvaluesthatissometimesusedasanalternativetocorrelation,particularlywhenanonlinearrelationshipissuspectedbetweenthepairsofvalues.Thismeasurecomesfrominformationtheory,whichisthe

y,ys, yt

ys yt

studyofhowtoformallydefineandquantifyinformation.Indeed,mutualinformationisameasureofhowmuchinformationonesetofvaluesprovidesaboutanother,giventhatthevaluescomeinpairs,e.g.,heightandweight.Ifthetwosetsofvaluesareindependent,i.e.,thevalueofonetellsusnothingabouttheother,thentheirmutualinformationis0.Ontheotherhand,ifthetwosetsofvaluesarecompletelydependent,i.e.,knowingthevalueofonetellsusthevalueoftheotherandvice-versa,thentheyhavemaximummutualinformation.Mutualinformationdoesnothaveamaximumvalue,butwewilldefineanormalizedversionofitthatrangesbetween0and1.

Todefinemutualinformation,weconsidertwosetsofvalues,XandY,whichoccurinpairs(X,Y).Weneedtomeasuretheaverageinformationinasinglesetofvalues,i.e.,eitherinXorinY,andinthepairsoftheirvalues.Thisiscommonlymeasuredbyentropy.Morespecifically,assumeXandYarediscrete,thatis,Xcantakemdistinctvalues, andYcantakendistinctvalues, Thentheirindividualandjointentropycanbedefinedintermsoftheprobabilitiesofeachvalueandpairofvaluesasfollows:

whereiftheprobabilityofavalueorcombinationofvaluesis0,thenisconventionallytakentobe0.

ThemutualinformationofXandYcannowbedefinedstraightforwardly:

u1,u2,…,umv1,v2,…,vn.

H(X)=−∑j=1mP(X=uj)log2P(X=uj) (2.12)

H(Y)=−∑k=1nP(Y=vk)log2P(Y=vk) (2.13)

H(X,Y)=−∑j=1m∑k=1nP(X=uj,Y=vk)log2P(X=uj,Y=vk) (2.14)

0log2(0)

I(X,Y)=H(X)+H(Y)−H(X,Y) (2.15)

NotethatH(X,Y)issymmetric,i.e., andthusmutualinformationisalsosymmetric,i.e.,

Practically,XandYareeitherthevaluesintwoattributesortworowsofthesamedataset.InExample2.22 ,wewillrepresentthosevaluesastwovectorsxandyandcalculatetheprobabilityofeachvalueorpairofvaluesfromthefrequencywithwhichvaluesorpairsofvaluesoccurinx,yand

where isthe componentofxand isthe componentofy.Letusillustrateusingapreviousexample.

Example2.22(EvaluatingNonlinearRelationshipswithMutualInformation).RecallExample2.19 where buttheircorrelationwas0.

FromFigure2.22 , Althoughavarietyofapproachestonormalizemutualinformationarepossible—seeBibliographicNotes—forthisexample,wewillapplyonethatdividesthemutualinformationby andproducesaresultbetween0and1.Thisyieldsavalueof Thus,wecanseethatxandyarestronglyrelated.Theyarenotperfectlyrelatedbecausegivenavalueofythereis,exceptfor someambiguityaboutthevalueofx.Noticethatfor thenormalizedmutualinformationwouldbe1.

Figure2.18.Computationofmutualinformation.

Table2.14.Entropyforx

H(X,Y)=H(Y,X),I(X,Y)=I(Y).

(xi,yi), xi ith yi ith

yk=xk2,

x=(−3,−2,−1,0,1,2,3)y=(9,4,1,0,1,4,9)

I(x,y)=H(x)+H(y)−H(x,y)=1.9502.

log2(min(m,n))1.9502/log2(4))=0.9751.

y=0,y=−x,

xj P(x=xj) −P(x=xj)log2P(x=xj)

1/7 0.4011

1/7 0.4011

1/7 0.4011

0 1/7 0.4011

1 1/7 0.4011

2 1/7 0.4011

3 1/7 0.4011

H(x) 2.8074

Table2.15.Entropyfory

9 2/7 0.5164

4 2/7 0.5164

1 2/7 0.5164

0 1/7 0.4011

H(y) 1.9502

Table2.16.Jointentropyforxandy

9 1/7 0.4011

4 1/7 0.4011

−3

−2

−1

yk P(y=yk) −P(y=yk)log2P(y=yk)

xj yk P(x=xj,y=xk) −P(x=xj,y=xk)log2P(x=xj,y=xk)

−3

−2

1 1/7 0.4011

0 0 1/7 0.4011

1 1 1/7 0.4011

2 4 1/7 0.4011

3 9 1/7 0.4011

H(x,y) 2.8074

2.4.7KernelFunctions*

Itiseasytounderstandhowsimilarityanddistancemightbeusefulinanapplicationsuchasclustering,whichtriestogroupsimilarobjectstogether.Whatismuchlessobviousisthatmanyotherdataanalysistasks,includingpredictivemodelinganddimensionalityreduction,canbeexpressedintermsofpairwise“proximities”ofdataobjects.Morespecifically,manydataanalysisproblemscanbemathematicallyformulatedtotakeasinput,akernelmatrix,K,whichcanbeconsideredatypeofproximitymatrix.Thus,aninitialpreprocessingstepisusedtoconverttheinputdataintoakernelmatrix,whichistheinputtothedataanalysisalgorithm.

Moreformally,ifadatasethasmdataobjects,thenKisanmbymmatrix.Ifand arethe and dataobjects,respectively,then the entryof

K,iscomputedbyakernelfunction:

−1

xi xj ith jth kij, ijth

kij=κ(xi,xj) (2.16)

Aswewillseeinthematerialthatfollows,theuseofakernelmatrixallowsbothwiderapplicabilityofanalgorithmtovariouskindsofdataandanabilitytomodelnonlinearrelationshipswithalgorithmsthataredesignedonlyfordetectinglinearrelationships.

Kernelsmakeanalgorithmdataindependent

Ifanalgorithmusesakernelmatrix,thenitcanbeusedwithanytypeofdataforwhichakernelfunctioncanbedesigned.ThisisillustratedbyAlgorithm2.1.Althoughonlysomedataanalysisalgorithmscanbemodifiedtouseakernelmatrixasinput,thisapproachisextremelypowerfulbecauseitallowssuchanalgorithmtobeusedwithalmostanytypeofdataforwhichanappropriatekernelfunctioncanbedefined.Thus,aclassificationalgorithmcanbeused,forexample,withrecorddata,stringdata,orgraphdata.Ifanalgorithmcanbereformulatedtouseakernelmatrix,thenitsapplicabilitytodifferenttypesofdataincreasesdramatically.Aswewillseeinlaterchapters,manyclustering,classification,andanomalydetectionalgorithmsworkonlywithsimilaritiesordistances,andthus,canbeeasilymodifiedtoworkwithkernels.

Algorithm2.1Basickernelalgorithm.1. Readinthemdataobjectsinthedataset.2. Computethekernelmatrix,Kbyapplyingthekernelfunction,

toeachpairofdataobjects.3. RunthedataanalysisalgorithmwithKasinput.4. Returntheanalysisresult,e.g.,predictedclassorclusterlabels.

Mappingdataintoahigherdimensionaldataspacecan

κ,

allowmodelingofnonlinearrelationships

Thereisyetanother,equallyimportant,aspectofkernelbaseddataalgorithms—theirabilitytomodelnonlinearrelationshipswithalgorithmsthatmodelonlylinearrelationships.Typically,thisworksbyfirsttransforming(mapping)thedatafromalowerdimensionaldataspacetoahigherdimensionalspace.

Example2.23(MappingDatatoaHigherDimensionalSpace).Considertherelationshipbetweentwovariablesxandygivenbythefollowingequation,whichdefinesanellipseintwodimensions(Figure2.19(a) ):

Figure2.19.Mappingdatatoahigherdimensionalspace:twotothreedimensions.

Wecanmapourtwodimensionaldatatothreedimensionsbycreatingthreenewvariables,u,v,andw,whicharedefinedasfollows:

Asaresult,wecannowexpressEquation2.17 asalinearone.Thisequationdescribesaplaneinthreedimensions.Pointsontheellipsewilllieonthatplane,whilepointsinsideandoutsidetheellipsewilllieonoppositesidesoftheplane.SeeFigure2.19(b) .Theviewpointofthis3Dplotisalongthesurfaceoftheseparatingplanesothattheplaneappearsasaline.

TheKernelTrick

Theapproachillustratedaboveshowsthevalueinmappingdatatohigherdimensionalspace,anoperationthatisintegraltokernel-basedmethods.Conceptually,wefirstdefineafunction thatmapsdatapointsxandytodatapoints and inahigherdimensionalspacesuchthattheinnerproduct givesthedesiredmeasureofproximityofxandy.Itmayseemthatwehavepotentiallysacrificedagreatdealbyusingsuchanapproach,becausewecangreatlyexpandthesizeofourdata,increasethecomputationalcomplexityofouranalysis,andencounterproblemswiththecurseofdimensionalitybycomputingsimilarityinahigh-dimensionalspace.However,thisisnotthecasesincetheseproblemscanbeavoidedbydefiningakernelfunction thatcancomputethesamesimilarityvalue,butwiththedatapointsintheoriginalspace,i.e., Thisisknownasthekerneltrick.Despitethename,thekerneltrickhasaverysolid

4×2+9xy+7y2=10 (2.17)

w=x2u=xyv=y2

4u+9v+7w=10 (2.18)

φφ(x) φ(y)

⟨x,y⟩

κκ(x,y)=⟨φ(x),φ(y)⟩.

mathematicalfoundationandisaremarkablypowerfulapproachfordataanalysis.

Noteveryfunctionofapairofdataobjectssatisfiesthepropertiesneededforakernelfunction,butithasbeenpossibletodesignmanyusefulkernelsforawidevarietyofdatatypes.Forexample,threecommonkernelfunctionsarethepolynomial,Gaussian(radialbasisfunction(RBF)),andsigmoidkernels.Ifxandyaretwodataobjects,specifically,twodatavectors,thenthesetwokernelfunctionscanbeexpressedasfollows,respectively:

where and areconstants,disanintegerparameterthatgivesthepolynomialdegree, isthelengthofthevector and isaparameterthatgovernsthe“spread”ofaGaussian.

Example2.24(ThePolynomialKernel).Notethatthekernelfunctionspresentedintheprevioussectionarecomputingthesamesimilarityvalueaswouldbecomputedifweactuallymappedthedatatoahigherdimensionalspaceandthencomputedaninnerproductthere.Forexample,forthepolynomialkernelofdegree2,letbethefunctionthatmapsatwo-dimensionaldatavector tothe

higherdimensionalspace.Specifically,let

κ(x,y)−(x′y+c)d (2.19)

κ(x,y)=exp(−ǁx−yǁ/2σ2) (2.20)

κ(x,y)=tanh(αx′y+c) (2.21)

α c≥0ǁx−yǁ x−y σ>0

φ x=(x1,x2)

φ(x)=(x12,x22,2x1x2,2cx1,2cx2,c). (2.22)

Forthehigherdimensionalspace,lettheproximitybedefinedastheinnerproductof and i.e., Then,aspreviouslymentioned,itcanbeshownthat

where isdefinedbyEquation2.19 above.Specifically,ifand then

Moregenerally,thekerneltrickdependsondefining and sothatEquation2.23 holds.Thishasbeendoneforawidevarietyofkernels.

Thisdiscussionofkernel-basedapproacheswasintendedonlytoprovideabriefintroductiontothistopicandhasomittedmanydetails.Afullerdiscussionofthekernel-basedapproachisprovidedinSection4.9.4,whichdiscussestheseissuesinthecontextofnonlinearsupportvectormachinesforclassification.MoregeneralreferencesforthekernelbasedanalysiscanbefoundintheBibliographicNotesofthischapter.

2.4.8BregmanDivergence*

ThissectionprovidesabriefdescriptionofBregmandivergences,whichareafamilyofproximityfunctionsthatsharesomecommonproperties.Asaresult,itispossibletoconstructgeneraldataminingalgorithms,suchasclusteringalgorithms,thatworkwithanyBregmandivergence.AconcreteexampleistheK-meansclusteringalgorithm(Section7.2).Notethatthissectionrequiresknowledgeofvectorcalculus.

φ(x) φ(y), ⟨φ(x),φ(y)⟩.

κ(x,y)=⟨φ(x),φ(y)⟩ (2.23)

κ x=(x1,x2)y=(y1,y2),

κ(x,y)=⟨x,y⟩=x′y=(x12y12,x22y22,2x1x2y1y2,2cx1y1,2cx2y2,c2).(2.24)

κ φ

Bregmandivergencesarelossordistortionfunctions.Tounderstandtheideaofalossfunction,considerthefollowing.Letxandybetwopoints,whereyisregardedastheoriginalpointandxissomedistortionorapproximationofit.Forexample,xmaybeapointthatwasgeneratedbyaddingrandomnoisetoy.Thegoalistomeasuretheresultingdistortionorlossthatresultsifyisapproximatedbyx.Ofcourse,themoresimilarxandyare,thesmallerthelossordistortion.Thus,Bregmandivergencescanbeusedasdissimilarityfunctions.

Moreformally,wehavethefollowingdefinition.

Definition2.6(Bregmandivergence)Givenastrictlyconvexfunction (withafewmodestrestrictionsthataregenerallysatisfied),theBregmandivergence(lossfunction) generatedbythatfunctionisgivenbythefollowingequation:

where isthegradientof evaluatedat isthevectordifferencebetweenxandy,and istheinnerproductbetween and ForpointsinEuclideanspace,theinnerproductisjustthedotproduct.

D(x,y)canbewrittenas whereandrepresentstheequationofaplanethatistangenttothefunction aty.

ϕ

D(x,y)

D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩ (2.25)

∇ϕ(y) ϕ y,x−y,⟨∇ϕ(y),(x−y)⟩

∇ϕ(y) (x−y).

D(x,y)=ϕ(x)−L(x), L(x)=ϕ(y)+⟨∇ϕ(y),(x−y)⟩ϕ

Usingcalculusterminology,L(x)isthelinearizationof aroundthepointy,andtheBregmandivergenceisjustthedifferencebetweenafunctionandalinearapproximationtothatfunction.DifferentBregmandivergencesareobtainedbyusingdifferentchoicesfor

Example2.25.WeprovideaconcreteexampleusingsquaredEuclideandistance,butrestrictourselvestoonedimensiontosimplifythemathematics.Letxandyberealnumbersand bethereal-valuedfunction, Inthatcase,thegradientreducestothederivative,andthedotproductreducestomultiplication.Specifically,Equation2.25 becomesEquation2.26 .

Thegraphforthisexample,with isshowninFigure2.20 .TheBregmandivergenceisshownfortwovaluesofx: and

ϕ

ϕ.

ϕ(t) ϕ(t)=t2.

D(x,y)=x2−y2−2y(x−y)=(x−y)2 (2.26)

y=1,x=2 x=3.

Figure2.20.IllustrationofBregmandivergence.

2.4.9IssuesinProximityCalculation

Thissectiondiscussesseveralimportantissuesrelatedtoproximitymeasures:(1)howtohandlethecaseinwhichattributeshavedifferentscalesand/orarecorrelated,(2)howtocalculateproximitybetweenobjectsthatarecomposedofdifferenttypesofattributes,e.g.,quantitativeandqualitative,(3)andhowtohandleproximitycalculationswhenattributeshavedifferentweights;i.e.,whennotallattributescontributeequallytotheproximityofobjects.

StandardizationandCorrelationforDistance

MeasuresAnimportantissuewithdistancemeasuresishowtohandlethesituationwhenattributesdonothavethesamerangeofvalues.(Thissituationisoftendescribedbysayingthat“thevariableshavedifferentscales.”)Inapreviousexample,Euclideandistancewasusedtomeasurethedistancebetweenpeoplebasedontwoattributes:ageandincome.Unlessthesetwoattributesarestandardized,thedistancebetweentwopeoplewillbedominatedbyincome.

Arelatedissueishowtocomputedistancewhenthereiscorrelationbetweensomeoftheattributes,perhapsinadditiontodifferencesintherangesofvalues.AgeneralizationofEuclideandistance,theMahalanobisdistance,isusefulwhenattributesarecorrelated,havedifferentrangesofvalues(differentvariances),andthedistributionofthedataisapproximatelyGaussian(normal).Correlatedvariableshavealargeimpactonstandarddistancemeasuressinceachangeinanyofthecorrelatedvariablesisreflectedinachangeinallthecorrelatedvariables.Specifically,theMahalanobisdistancebetweentwoobjects(vectors)xandyisdefinedas

where istheinverseofthecovariancematrixofthedata.Notethatthecovariancematrix isthematrixwhose entryisthecovarianceoftheand attributesasdefinedbyEquation2.11 .

Example2.26.InFigure2.21 ,thereare1000points,whosexandyattributeshaveacorrelationof0.6.Thedistancebetweenthetwolargepointsattheoppositeendsofthelongaxisoftheellipseis14.7intermsofEuclidean

Mahalanobis(x,y)=(x−y)′∑−1(x−y), (2.27)

∑−1∑ ijth ith

jth

distance,butonly6withrespecttoMahalanobisdistance.ThisisbecausetheMahalanobisdistancegiveslessemphasistothedirectionoflargestvariance.Inpractice,computingtheMahalanobisdistanceisexpensive,butcanbeworthwhilefordatawhoseattributesarecorrelated.Iftheattributesarerelativelyuncorrelated,buthavedifferentranges,thenstandardizingthevariablesissufficient.

Figure2.21.Setoftwo-dimensionalpoints.TheMahalanobisdistancebetweenthetwopointsrepresentedbylargedotsis6;theirEuclideandistanceis14.7.

CombiningSimilaritiesforHeterogeneousAttributes

Thepreviousdefinitionsofsimilaritywerebasedonapproachesthatassumedalltheattributeswereofthesametype.Ageneralapproachisneededwhentheattributesareofdifferenttypes.OnestraightforwardapproachistocomputethesimilaritybetweeneachattributeseparatelyusingTable2.7 ,andthencombinethesesimilaritiesusingamethodthatresultsinasimilaritybetween0and1.Onepossibleapproachistodefinetheoverallsimilarityastheaverageofalltheindividualattributesimilarities.Unfortunately,thisapproachdoesnotworkwellifsomeoftheattributesareasymmetricattributes.Forexample,ifalltheattributesareasymmetricbinaryattributes,thenthesimilaritymeasuresuggestedpreviouslyreducestothesimplematchingcoefficient,ameasurethatisnotappropriateforasymmetricbinaryattributes.Theeasiestwaytofixthisproblemistoomitasymmetricattributesfromthesimilaritycalculationwhentheirvaluesare0forbothoftheobjectswhosesimilarityisbeingcomputed.Asimilarapproachalsoworkswellforhandlingmissingvalues.

Insummary,Algorithm2.2iseffectiveforcomputinganoverallsimilaritybetweentwoobjects,xandy,withdifferenttypesofattributes.Thisprocedurecanbeeasilymodifiedtoworkwithdissimilarities.

Algorithm2.2Similaritiesofheterogeneous

objects.1:Forthe attribute,computeasimilarity, intherange[0,1].2:Defineanindicatorvariable, forthe attributeasfollows:

kth sk(x,y),

δk, kth

δk={0ifthekthattributeisanasymmetricattributeandbothobjectshaveavalueof

UsingWeightsInmuchofthepreviousdiscussion,allattributesweretreatedequallywhencomputingproximity.Thisisnotdesirablewhensomeattributesaremoreimportanttothedefinitionofproximitythanothers.Toaddressthesesituations,theformulasforproximitycanbemodifiedbyweightingthecontributionofeachattribute.

Withattributeweights, (2.28)becomes

ThedefinitionoftheMinkowskidistancecanalsobemodifiedasfollows:

2.4.10SelectingtheRightProximityMeasure

Afewgeneralobservationsmaybehelpful.First,thetypeofproximitymeasureshouldfitthetypeofdata.Formanytypesofdense,continuousdata,metricdistancemeasuressuchasEuclideandistanceareoftenused.Proximitybetweencontinuousattributesismostoftenexpressedintermsof

3:Computetheoverallsimilaritybetweenthetwoobjectsusingthefollowingformula:similarity(x,y)=∑k=1nδksk(x,y)∑k=1nδk(2.28)

wk,

similarity(x,y)=∑k=1nwkδksk(x,y)∑k=1nwkδk. (2.29)

d(x,y)=(∑k=1nwk|xk−yk|r)1/r. (2.30)

differences,anddistancemeasuresprovideawell-definedwayofcombiningthesedifferencesintoanoverallproximitymeasure.Althoughattributescanhavedifferentscalesandbeofdifferingimportance,theseissuescanoftenbedealtwithasdescribedearlier,suchasnormalizationandweightingofattributes.

Forsparsedata,whichoftenconsistsofasymmetricattributes,wetypicallyemploysimilaritymeasuresthatignore0–0matches.Conceptually,thisreflectsthefactthat,forapairofcomplexobjects,similaritydependsonthenumberofcharacteristicstheybothshare,ratherthanthenumberofcharacteristicstheybothlack.Thecosine,Jaccard,andextendedJaccardmeasuresareappropriateforsuchdata.

Thereareothercharacteristicsofdatavectorsthatoftenneedtobeconsidered.Invariancetoscaling(multiplication)andtotranslation(addition)werepreviouslydiscussedwithrespecttoEuclideandistanceandthecosineandcorrelationmeasures.Thepracticalimplicationsofsuchconsiderationsarethat,forexample,cosineismoresuitableforsparsedocumentdatawhereonlyscalingisimportant,whilecorrelationworksbetterfortimeseries,wherebothscalingandtranslationareimportant.EuclideandistanceorothertypesofMinkowskidistancearemostappropriatewhentwodatavectorsaretomatchascloselyaspossibleacrossallcomponents(features).

Insomecases,transformationornormalizationofthedataisneededtoobtainapropersimilaritymeasure.Forinstance,timeseriescanhavetrendsorperiodicpatternsthatsignificantlyimpactsimilarity.Also,apropercomputationofsimilarityoftenrequiresthattimelagsbetakenintoaccount.Finally,twotimeseriesmaybesimilaronlyoverspecificperiodsoftime.Forexample,thereisastrongrelationshipbetweentemperatureandtheuseofnaturalgas,butonlyduringtheheatingseason.

Practicalconsiderationcanalsobeimportant.Sometimes,oneormoreproximitymeasuresarealreadyinuseinaparticularfield,andthus,otherswillhaveansweredthequestionofwhichproximitymeasuresshouldbeused.Othertimes,thesoftwarepackageorclusteringalgorithmbeingusedcandrasticallylimitthechoices.Ifefficiencyisaconcern,thenwemaywanttochooseaproximitymeasurethathasaproperty,suchasthetriangleinequality,thatcanbeusedtoreducethenumberofproximitycalculations.(SeeExercise25 .)

However,ifcommonpracticeorpracticalrestrictionsdonotdictateachoice,thentheproperchoiceofaproximitymeasurecanbeatime-consumingtaskthatrequirescarefulconsiderationofbothdomainknowledgeandthepurposeforwhichthemeasureisbeingused.Anumberofdifferentsimilaritymeasuresmayneedtobeevaluatedtoseewhichonesproduceresultsthatmakethemostsense.

2.5BibliographicNotesItisessentialtounderstandthenatureofthedatathatisbeinganalyzed,andatafundamentallevel,thisisthesubjectofmeasurementtheory.Inparticular,oneoftheinitialmotivationsfordefiningtypesofattributeswastobepreciseaboutwhichstatisticaloperationswerevalidforwhatsortsofdata.WehavepresentedtheviewofmeasurementtheorythatwasinitiallydescribedinaclassicpaperbyS.S.Stevens[112].(Tables2.2 and2.3 arederivedfromthosepresentedbyStevens[113].)Whilethisisthemostcommonviewandisreasonablyeasytounderstandandapply,thereis,ofcourse,muchmoretomeasurementtheory.Anauthoritativediscussioncanbefoundinathree-volumeseriesonthefoundationsofmeasurementtheory[88,94,114].Alsoofinterestisawide-rangingarticlebyHand[77],whichdiscussesmeasurementtheoryandstatistics,andisaccompaniedbycommentsfromotherresearchersinthefield.NumerouscritiquesandextensionsoftheapproachofStevenshavebeenmade[66,97,117].Finally,manybooksandarticlesdescribemeasurementissuesforparticularareasofscienceandengineering.

Dataqualityisabroadsubjectthatspanseverydisciplinethatusesdata.Discussionsofprecision,bias,accuracy,andsignificantfigurescanbefoundinmanyintroductoryscience,engineering,andstatisticstextbooks.Theviewofdataqualityas“fitnessforuse”isexplainedinmoredetailinthebookbyRedman[103].ThoseinterestedindataqualitymayalsobeinterestedinMIT’sInformationQuality(MITIQ)Program[95,118].However,theknowledgeneededtodealwithspecificdataqualityissuesinaparticulardomainisoftenbestobtainedbyinvestigatingthedataqualitypracticesofresearchersinthatfield.

Aggregationisalesswell-definedsubjectthanmanyotherpreprocessingtasks.However,aggregationisoneofthemaintechniquesusedbythedatabaseareaofOnlineAnalyticalProcessing(OLAP)[68,76,102].Therehasalsobeenrelevantworkintheareaofsymbolicdataanalysis(BockandDiday[64]).Oneofthegoalsinthisareaistosummarizetraditionalrecorddataintermsofsymbolicdataobjectswhoseattributesaremorecomplexthantraditionalattributes.Specifically,theseattributescanhavevaluesthataresetsofvalues(categories),intervals,orsetsofvalueswithweights(histograms).Anothergoalofsymbolicdataanalysisistobeabletoperformclustering,classification,andotherkindsofdataanalysisondatathatconsistsofsymbolicdataobjects.

Samplingisasubjectthathasbeenwellstudiedinstatisticsandrelatedfields.Manyintroductorystatisticsbooks,suchastheonebyLindgren[90],havesomediscussionaboutsampling,andentirebooksaredevotedtothesubject,suchastheclassictextbyCochran[67].AsurveyofsamplingfordataminingisprovidedbyGuandLiu[74],whileasurveyofsamplingfordatabasesisprovidedbyOlkenandRotem[98].Thereareanumberofotherdatamininganddatabase-relatedsamplingreferencesthatmaybeofinterest,includingpapersbyPalmerandFaloutsos[100],Provostetal.[101],Toivonen[115],andZakietal.[119].

Instatistics,thetraditionaltechniquesthathavebeenusedfordimensionalityreductionaremultidimensionalscaling(MDS)(BorgandGroenen[65],KruskalandUslaner[89])andprincipalcomponentanalysis(PCA)(Jolliffe[80]),whichissimilartosingularvaluedecomposition(SVD)(Demmel[70]).DimensionalityreductionisdiscussedinmoredetailinAppendixB.

Discretizationisatopicthathasbeenextensivelyinvestigatedindatamining.Someclassificationalgorithmsworkonlywithcategoricaldata,andassociationanalysisrequiresbinarydata,andthus,thereisasignificant

motivationtoinvestigatehowtobestbinarizeordiscretizecontinuousattributes.Forassociationanalysis,wereferthereadertoworkbySrikantandAgrawal[111],whilesomeusefulreferencesfordiscretizationintheareaofclassificationincludeworkbyDoughertyetal.[71],ElomaaandRousu[72],FayyadandIrani[73],andHussainetal.[78].

Featureselectionisanothertopicwellinvestigatedindatamining.AbroadcoverageofthistopicisprovidedinasurveybyMolinaetal.[96]andtwobooksbyLiuandMotada[91,92].OtherusefulpapersincludethosebyBlumandLangley[63],KohaviandJohn[87],andLiuetal.[93].

Itisdifficulttoprovidereferencesforthesubjectoffeaturetransformationsbecausepracticesvaryfromonedisciplinetoanother.Manystatisticsbookshaveadiscussionoftransformations,buttypicallythediscussionisrestrictedtoaparticularpurpose,suchasensuringthenormalityofavariableormakingsurethatvariableshaveequalvariance.Weoffertworeferences:Osborne[99]andTukey[116].

Whilewehavecoveredsomeofthemostcommonlyuseddistanceandsimilaritymeasures,therearehundredsofsuchmeasuresandmorearebeingcreatedallthetime.Aswithsomanyothertopicsinthischapter,manyofthesemeasuresarespecifictoparticularfields,e.g.,intheareaoftimeseriesseepapersbyKalpakisetal.[81]andKeoghandPazzani[83].Clusteringbooksprovidethebestgeneraldiscussions.Inparticular,seethebooksbyAnderberg[62],JainandDubes[79],KaufmanandRousseeuw[82],andSneathandSokal[109].

Information-basedmeasuresofsimilarityhavebecomemorepopularlatelydespitethecomputationaldifficultiesandexpenseofcalculatingthem.AgoodintroductiontoinformationtheoryisprovidedbyCoverandThomas[69].Computingthemutualinformationforcontinuousvariablescanbe

straightforwardiftheyfollowawell-knowdistribution,suchasGaussian.However,thisisoftennotthecase,andmanytechniqueshavebeendeveloped.Asoneexample,thearticlebyKhan,etal.[85]comparesvariousmethodsinthecontextofcomparingshorttimeseries.SeealsotheinformationandmutualinformationpackagesforRandMatlab.MutualinformationhasbeenthesubjectofconsiderablerecentattentionduetopaperbyReshef,etal.[104,105]thatintroducedanalternativemeasure,albeitonebasedonmutualinformation,whichwasclaimedtohavesuperiorproperties.Althoughthisapproachhadsomeearlysupport,e.g.,[110],othershavepointedoutvariouslimitations[75,86,108].

Twopopularbooksonthetopicofkernelmethodsare[106]and[107].Thelatteralsohasawebsitewithlinkstokernel-relatedmaterials[84].Inaddition,manycurrentdatamining,machinelearning,andstatisticallearningtextbookshavesomematerialaboutkernelmethods.FurtherreferencesforkernelmethodsinthecontextofsupportvectormachineclassifiersareprovidedinthebibliographicNotesofSection4.9.4.

Bibliography[62]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,New

York,December1973.

[63]A.BlumandP.Langley.SelectionofRelevantFeaturesandExamplesinMachineLearning.ArtificialIntelligence,97(1–2):245–271,1997.

[64]H.H.BockandE.Diday.AnalysisofSymbolicData:ExploratoryMethodsforExtractingStatisticalInformationfromComplexData(StudiesinClassification,DataAnalysis,andKnowledgeOrganization).Springer-VerlagTelos,January2000.

[65]I.BorgandP.Groenen.ModernMultidimensionalScaling—TheoryandApplications.Springer-Verlag,February1997.

[66]N.R.Chrisman.Rethinkinglevelsofmeasurementforcartography.CartographyandGeographicInformationSystems,25(4):231–242,1998.

[67]W.G.Cochran.SamplingTechniques.JohnWiley&Sons,3rdedition,July1977.

[68]E.F.Codd,S.B.Codd,andC.T.Smalley.ProvidingOLAP(On-lineAnalyticalProcessing)toUser-Analysts:AnITMandate.WhitePaper,E.F.CoddandAssociates,1993.

[69]T.M.CoverandJ.A.Thomas.Elementsofinformationtheory.JohnWiley&Sons,2012.

[70]J.W.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrial&AppliedMathematics,September1997.

[71]J.Dougherty,R.Kohavi,andM.Sahami.SupervisedandUnsupervisedDiscretizationofContinuousFeatures.InProc.ofthe12thIntl.Conf.onMachineLearning,pages194–202,1995.

[72]T.ElomaaandJ.Rousu.GeneralandEfficientMultisplittingofNumericalAttributes.MachineLearning,36(3):201–244,1999.

[73]U.M.FayyadandK.B.Irani.Multi-intervaldiscretizationofcontinuousvaluedattributesforclassificationlearning.InProc.13thInt.JointConf.onArtificialIntelligence,pages1022–1027.MorganKaufman,1993.

[74]F.H.GaohuaGuandH.Liu.SamplingandItsApplicationinDataMining:ASurvey.TechnicalReportTRA6/00,NationalUniversityofSingapore,Singapore,2000.

[75]M.Gorfine,R.Heller,andY.Heller.CommentonDetectingnovelassociationsinlargedatasets.Unpublished(availableathttp://emotion.technion.ac.il/gorfinm/files/science6.pdfon11Nov.2012),2012.

[76]J.Gray,S.Chaudhuri,A.Bosworth,A.Layman,D.Reichart,M.Venkatrao,F.Pellow,andH.Pirahesh.DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals.JournalDataMiningandKnowledgeDiscovery,1(1):29–53,1997.

[77]D.J.Hand.StatisticsandtheTheoryofMeasurement.JournaloftheRoyalStatisticalSociety:SeriesA(StatisticsinSociety),159(3):445–492,1996.

[78]F.Hussain,H.Liu,C.L.Tan,andM.Dash.TRC6/99:Discretization:anenablingtechnique.Technicalreport,NationalUniversityofSingapore,Singapore,1999.

[79]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[80]I.T.Jolliffe.PrincipalComponentAnalysis.SpringerVerlag,2ndedition,October2002.

[81]K.Kalpakis,D.Gada,andV.Puttagunta.DistanceMeasuresforEffectiveClusteringofARIMATime-Series.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages273–280.IEEEComputerSociety,2001.

[82]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[83]E.J.KeoghandM.J.Pazzani.Scalingupdynamictimewarpingfordataminingapplications.InKDD,pages285–289,2000.

[84]KernelMethodsforPatternAnalysisWebsite.http://www.kernel-methods.net/,2014.

[85]S.Khan,S.Bandyopadhyay,A.R.Ganguly,S.Saigal,D.J.EricksonIII,V.Protopopescu,andG.Ostrouchov.Relativeperformanceofmutualinformationestimationmethodsforquantifyingthedependenceamongshortandnoisydata.PhysicalReviewE,76(2):026209,2007.

[86]J.B.KinneyandG.S.Atwal.Equitability,mutualinformation,andthemaximalinformationcoefficient.ProceedingsoftheNationalAcademyofSciences,2014.

[87]R.KohaviandG.H.John.WrappersforFeatureSubsetSelection.ArtificialIntelligence,97(1–2):273–324,1997.

[88]D.Krantz,R.D.Luce,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume1:Additiveandpolynomialrepresentations.AcademicPress,NewYork,1971.

[89]J.B.KruskalandE.M.Uslaner.MultidimensionalScaling.SagePublications,August1978.

[90]B.W.Lindgren.StatisticalTheory.CRCPress,January1993.

[91]H.LiuandH.Motoda,editors.FeatureExtraction,ConstructionandSelection:ADataMiningPerspective.KluwerInternationalSeriesinEngineeringandComputerScience,453.KluwerAcademicPublishers,July1998.

[92]H.LiuandH.Motoda.FeatureSelectionforKnowledgeDiscoveryandDataMining.KluwerInternationalSeriesinEngineeringandComputerScience,454.KluwerAcademicPublishers,July1998.

[93]H.Liu,H.Motoda,andL.Yu.FeatureExtraction,Selection,andConstruction.InN.Ye,editor,TheHandbookofDataMining,pages22–41.LawrenceErlbaumAssociates,Inc.,Mahwah,NJ,2003.

[94]R.D.Luce,D.Krantz,P.Suppes,andA.Tversky.FoundationsofMeasurements:Volume3:Representation,Axiomatization,andInvariance.AcademicPress,NewYork,1990.

[95]MITInformationQuality(MITIQ)Program.http://mitiq.mit.edu/,2014.

[96]L.C.Molina,L.Belanche,andA.Nebot.FeatureSelectionAlgorithms:ASurveyandExperimentalEvaluation.InProc.ofthe2002IEEEIntl.Conf.onDataMining,2002.

[97]F.MostellerandJ.W.Tukey.Dataanalysisandregression:asecondcourseinstatistics.Addison-Wesley,1977.

[98]F.OlkenandD.Rotem.RandomSamplingfromDatabases—ASurvey.Statistics&Computing,5(1):25–42,March1995.

[99]J.Osborne.NotesontheUseofDataTransformations.PracticalAssessment,Research&Evaluation,28(6),2002.

[100]C.R.PalmerandC.Faloutsos.Densitybiasedsampling:Animprovedmethodfordataminingandclustering.ACMSIGMODRecord,29(2):82–92,2000.

[101]F.J.Provost,D.Jensen,andT.Oates.EfficientProgressiveSampling.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages23–32,1999.

[102]R.RamakrishnanandJ.Gehrke.DatabaseManagementSystems.McGraw-Hill,3rdedition,August2002.

[103]T.C.Redman.DataQuality:TheFieldGuide.DigitalPress,January2001.

[104]D.Reshef,Y.Reshef,M.Mitzenmacher,andP.Sabeti.Equitabilityanalysisofthemaximalinformationcoefficient,withcomparisons.arXivpreprintarXiv:1301.6314,2013.

[105]D.N.Reshef,Y.A.Reshef,H.K.Finucane,S.R.Grossman,G.McVean,P.J.Turnbaugh,E.S.Lander,M.Mitzenmacher,andP.C.

Sabeti.Detectingnovelassociationsinlargedatasets.science,334(6062):1518–1524,2011.

[106]B.SchölkopfandA.J.Smola.Learningwithkernels:supportvectormachines,regularization,optimization,andbeyond.MITpress,2002.

[107]J.Shawe-TaylorandN.Cristianini.Kernelmethodsforpatternanalysis.Cambridgeuniversitypress,2004.

[108]N.SimonandR.Tibshirani.Commenton”DetectingNovelAssociationsInLargeDataSets”byReshefEtAl,ScienceDec16,2011.arXivpreprintarXiv:1401.7645,2014.

[109]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.

[110]T.Speed.Acorrelationforthe21stcentury.Science,334(6062):1502–1503,2011.

[111]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Quebec,Canada,August1996.

[112]S.S.Stevens.OntheTheoryofScalesofMeasurement.Science,103(2684):677–680,June1946.

[113]S.S.Stevens.Measurement.InG.M.Maranell,editor,Scaling:ASourcebookforBehavioralScientists,pages22–41.AldinePublishingCo.,Chicago,1974.

[114]P.Suppes,D.Krantz,R.D.Luce,andA.Tversky.FoundationsofMeasurements:Volume2:Geometrical,Threshold,andProbabilisticRepresentations.AcademicPress,NewYork,1989.

[115]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InVLDB96,pages134–145.MorganKaufman,September1996.

[116]J.W.Tukey.OntheComparativeAnatomyofTransformations.AnnalsofMathematicalStatistics,28(3):602–632,September1957.

[117]P.F.VellemanandL.Wilkinson.Nominal,ordinal,interval,andratiotypologiesaremisleading.TheAmericanStatistician,47(1):65–72,1993.

[118]R.Y.Wang,M.Ziad,Y.W.Lee,andY.R.Wang.DataQuality.TheKluwerInternationalSeriesonAdvancesinDatabaseSystems,Volume23.KluwerAcademicPublishers,January2001.

[119]M.J.Zaki,S.Parthasarathy,W.Li,andM.Ogihara.EvaluationofSamplingforDataMiningofAssociationRules.TechnicalReportTR617,RensselaerPolytechnicInstitute,1996.

2.6Exercises1.IntheinitialexampleofChapter2 ,thestatisticiansays,“Yes,fields2and3arebasicallythesame.”Canyoutellfromthethreelinesofsampledatathatareshownwhyshesaysthat?

2.Classifythefollowingattributesasbinary,discrete,orcontinuous.Alsoclassifythemasqualitative(nominalorordinal)orquantitative(intervalorratio).Somecasesmayhavemorethanoneinterpretation,sobrieflyindicateyourreasoningifyouthinktheremaybesomeambiguity.

Example:Ageinyears.Answer:Discrete,quantitative,ratio

a. TimeintermsofAMorPM.

b. Brightnessasmeasuredbyalightmeter.

c. Brightnessasmeasuredbypeople’sjudgments.

d. Anglesasmeasuredindegreesbetween0and360.

e. Bronze,Silver,andGoldmedalsasawardedattheOlympics.

f. Heightabovesealevel.

g. Numberofpatientsinahospital.

h. ISBNnumbersforbooks.(LookuptheformatontheWeb.)

i. Abilitytopasslightintermsofthefollowingvalues:opaque,translucent,transparent.

j. Militaryrank.

k. Distancefromthecenterofcampus.

l. Densityofasubstanceingramspercubiccentimeter.

m. Coatchecknumber.(Whenyouattendanevent,youcanoftengiveyourcoattosomeonewho,inturn,givesyouanumberthatyoucanusetoclaimyourcoatwhenyouleave.)

3.Youareapproachedbythemarketingdirectorofalocalcompany,whobelievesthathehasdevisedafoolproofwaytomeasurecustomersatisfaction.Heexplainshisschemeasfollows:“It’ssosimplethatIcan’tbelievethatnoonehasthoughtofitbefore.Ijustkeeptrackofthenumberofcustomercomplaintsforeachproduct.Ireadinadataminingbookthatcountsareratioattributes,andso,mymeasureofproductsatisfactionmustbearatioattribute.ButwhenIratedtheproductsbasedonmynewcustomersatisfactionmeasureandshowedthemtomyboss,hetoldmethatIhadoverlookedtheobvious,andthatmymeasurewasworthless.Ithinkthathewasjustmadbecauseourbestsellingproducthadtheworstsatisfactionsinceithadthemostcomplaints.Couldyouhelpmesethimstraight?”

a. Whoisright,themarketingdirectororhisboss?Ifyouanswered,hisboss,whatwouldyoudotofixthemeasureofsatisfaction?

b. Whatcanyousayabouttheattributetypeoftheoriginalproductsatisfactionattribute?

4.Afewmonthslater,youareagainapproachedbythesamemarketingdirectorasinExercise3 .Thistime,hehasdevisedabetterapproachtomeasuretheextenttowhichacustomerprefersoneproductoverothersimilarproducts.Heexplains,“Whenwedevelopnewproducts,wetypicallycreateseveralvariationsandevaluatewhichonecustomersprefer.Ourstandardprocedureistogiveourtestsubjectsalloftheproductvariationsatonetimeandthenaskthemtoranktheproductvariationsinorderofpreference.However,ourtestsubjectsareveryindecisive,especiallywhenthereare

morethantwoproducts.Asaresult,testingtakesforever.Isuggestedthatweperformthecomparisonsinpairsandthenusethesecomparisonstogettherankings.Thus,ifwehavethreeproductvariations,wehavethecustomerscomparevariations1and2,then2and3,andfinally3and1.Ourtestingtimewithmynewprocedureisathirdofwhatitwasfortheoldprocedure,buttheemployeesconductingthetestscomplainthattheycannotcomeupwithaconsistentrankingfromtheresults.Andmybosswantsthelatestproductevaluations,yesterday.Ishouldalsomentionthathewasthepersonwhocameupwiththeoldproductevaluationapproach.Canyouhelpme?”

a. Isthemarketingdirectorintrouble?Willhisapproachworkforgeneratinganordinalrankingoftheproductvariationsintermsofcustomerpreference?Explain.

b. Isthereawaytofixthemarketingdirector’sapproach?Moregenerally,whatcanyousayabouttryingtocreateanordinalmeasurementscalebasedonpairwisecomparisons?

c. Fortheoriginalproductevaluationscheme,theoverallrankingsofeachproductvariationarefoundbycomputingitsaverageoveralltestsubjects.Commentonwhetheryouthinkthatthisisareasonableapproach.Whatotherapproachesmightyoutake?

5.Canyouthinkofasituationinwhichidentificationnumberswouldbeusefulforprediction?

6.Aneducationalpsychologistwantstouseassociationanalysistoanalyzetestresults.Thetestconsistsof100questionswithfourpossibleanswerseach.

a. Howwouldyouconvertthisdataintoaformsuitableforassociationanalysis?

b. Inparticular,whattypeofattributeswouldyouhaveandhowmanyofthemarethere?

7.Whichofthefollowingquantitiesislikelytoshowmoretemporalautocorrelation:dailyrainfallordailytemperature?Why?

8.Discusswhyadocument-termmatrixisanexampleofadatasetthathasasymmetricdiscreteorasymmetriccontinuousfeatures.

9.Manysciencesrelyonobservationinsteadof(orinadditionto)designedexperiments.Comparethedataqualityissuesinvolvedinobservationalsciencewiththoseofexperimentalscienceanddatamining.

10.Discussthedifferencebetweentheprecisionofameasurementandthetermssingleanddoubleprecision,astheyareusedincomputerscience,typicallytorepresentfloating-pointnumbersthatrequire32and64bits,respectively.

11.Giveatleasttwoadvantagestoworkingwithdatastoredintextfilesinsteadofinabinaryformat.

12.Distinguishbetweennoiseandoutliers.Besuretoconsiderthefollowingquestions.

a. Isnoiseeverinterestingordesirable?Outliers?

b. Cannoiseobjectsbeoutliers?

c. Arenoiseobjectsalwaysoutliers?

d. Areoutliersalwaysnoiseobjects?

e. Cannoisemakeatypicalvalueintoanunusualone,orviceversa?

Algorithm2.3Algorithmforfindingk-

nearestneighbors.

13.ConsidertheproblemoffindingtheK-nearestneighborsofadataobject.AprogrammerdesignsAlgorithm2.3forthistask.

a. Describethepotentialproblemswiththisalgorithmifthereareduplicateobjectsinthedataset.Assumethedistancefunctionwillreturnadistanceof0onlyforobjectsthatarethesame.

b. Howwouldyoufixthisproblem?

14.ThefollowingattributesaremeasuredformembersofaherdofAsianelephants:weight,height,tusklength,trunklength,andeararea.Basedonthesemeasurements,whatsortofproximitymeasurefromSection2.4wouldyouusetocompareorgrouptheseelephants?Justifyyouranswerandexplainanyspecialcircumstances.

15.Youaregivenasetofmobjectsthatisdividedintokgroups,wheretheigroupisofsize Ifthegoalistoobtainasampleofsize whatisthedifferencebetweenthefollowingtwosamplingschemes?(Assumesamplingwithreplacement.)

:for tonumberofdataobjectsdo

:Findthedistancesofthe objecttoallotherobjects.

3:Sortthesedistancesindecreasingorder.

(Keeptrackofwhichobjectisassociatedwitheachdistance.)

4:returntheobjectsassociatedwiththefirstkdistancesofthesortedlist

5:endfor

1 i=1

2 ith

th

mi. n<m,

a. Werandomlyselect elementsfromeachgroup.

b. Werandomlyselectnelementsfromthedataset,withoutregardforthegrouptowhichanobjectbelongs.

16.Consideradocument-termmatrix,where isthefrequencyoftheword(term)inthe documentandmisthenumberofdocuments.Considerthevariabletransformationthatisdefinedby

where isthenumberofdocumentsinwhichthe termappears,whichisknownasthedocumentfrequencyoftheterm.Thistransformationisknownastheinversedocumentfrequencytransformation.

a. Whatistheeffectofthistransformationifatermoccursinonedocument?Ineverydocument?

b. Whatmightbethepurposeofthistransformation?

17.Assumethatweapplyasquareroottransformationtoaratioattributextoobtainthenewattribute Aspartofyouranalysis,youidentifyaninterval(a,b)inwhich hasalinearrelationshiptoanotherattributey.

a. Whatisthecorrespondinginterval(A,B)intermsofx?

b. Giveanequationthatrelatesytox.

18.Thisexercisecomparesandcontrastssomesimilarityanddistancemeasures.

a. Forbinarydata,theL1distancecorrespondstotheHammingdistance;thatis,thenumberofbitsthataredifferentbetweentwobinaryvectors.TheJaccardsimilarityisameasureofthesimilaritybetweentwobinary

n×mi/m

tfij ithjth

tfij′=tfij×logmdfi, (2.31)

dfi ith

x*.x*

vectors.ComputetheHammingdistanceandtheJaccardsimilaritybetweenthefollowingtwobinaryvectors.

b. Whichapproach,JaccardorHammingdistance,ismoresimilartotheSimpleMatchingCoefficient,andwhichapproachismoresimilartothecosinemeasure?Explain.(Note:TheHammingmeasureisadistance,whiletheotherthreemeasuresaresimilarities,butdon’tletthisconfuseyou.)

c. Supposethatyouarecomparinghowsimilartwoorganismsofdifferentspeciesareintermsofthenumberofgenestheyshare.Describewhichmeasure,HammingorJaccard,youthinkwouldbemoreappropriateforcomparingthegeneticmakeupoftwoorganisms.Explain.(Assumethateachanimalisrepresentedasabinaryvector,whereeachattributeis1ifaparticulargeneispresentintheorganismand0otherwise.)

d. Ifyouwantedtocomparethegeneticmakeupoftwoorganismsofthesamespecies,e.g.,twohumanbeings,wouldyouusetheHammingdistance,theJaccardcoefficient,oradifferentmeasureofsimilarityordistance?Explain.(Notethattwohumanbeingsshare ofthesamegenes.)

19.Forthefollowingvectors,xandy,calculatetheindicatedsimilarityordistancemeasures.

a. cosine,correlation,Euclidean

b. cosine,correlation,Euclidean,Jaccard

c. cosine,correlation,Euclidean

d. cosine,correlation,Jaccard

x=0101010001y=0100011000

>99.9%

x=(1,1,1,1),y=(2,2,2,2)

x=(0,1,0,1),y=(1,0,1,0)

x=(0,−1,0,1),y=(1,0,−1,0)

x=(1,1,0,1,0,1),y=(1,1,1,0,0,1)

e. cosine,correlation

20.Here,wefurtherexplorethecosineandcorrelationmeasures.

a. Whatistherangeofvaluespossibleforthecosinemeasure?

b. Iftwoobjectshaveacosinemeasureof1,aretheyidentical?Explain.

c. Whatistherelationshipofthecosinemeasuretocorrelation,ifany?(Hint:Lookatstatisticalmeasuressuchasmeanandstandarddeviationincaseswherecosineandcorrelationarethesameanddifferent.)

d. Figure2.22(a) showstherelationshipofthecosinemeasuretoEuclideandistancefor100,000randomlygeneratedpointsthathavebeennormalizedtohaveanL2lengthof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcosinesimilaritywhenvectorshaveanL2normof1?

Figure2.22.GraphsforExercise20 .

x=(2,−1,0,2,0,−3),y=(−1,1,−1,0,0,−1)

e. Figure2.22(b) showstherelationshipofcorrelationtoEuclideandistancefor100,000randomlygeneratedpointsthathavebeenstandardizedtohaveameanof0andastandarddeviationof1.WhatgeneralobservationcanyoumakeabouttherelationshipbetweenEuclideandistanceandcorrelationwhenthevectorshavebeenstandardizedtohaveameanof0andastandarddeviationof1?

f. DerivethemathematicalrelationshipbetweencosinesimilarityandEuclideandistancewheneachdataobjecthasanL lengthof1.

g. DerivethemathematicalrelationshipbetweencorrelationandEuclideandistancewheneachdatapointhasbeenbeenstandardizedbysubtractingitsmeananddividingbyitsstandarddeviation.

21.Showthatthesetdifferencemetricgivenby

satisfiesthemetricaxiomsgivenonpage77 .AandBaresetsand isthesetdifference.

22.Discusshowyoumightmapcorrelationvaluesfromtheinterval totheinterval[0,1].Notethatthetypeoftransformationthatyouusemightdependontheapplicationthatyouhaveinmind.Thus,considertwoapplications:clusteringtimeseriesandpredictingthebehaviorofonetimeseriesgivenanother.

23.Givenasimilaritymeasurewithvaluesintheinterval[0,1],describetwowaystotransformthissimilarityvalueintoadissimilarityvalueintheinterval

24.Proximityistypicallydefinedbetweenapairofobjects.

2

d(A,B)=size(A−B)+size(B−A) (2.32)

A−B

[−1,1]

[0,∞].

a. Definetwowaysinwhichyoumightdefinetheproximityamongagroupofobjects.

b. HowmightyoudefinethedistancebetweentwosetsofpointsinEuclideanspace?

c. Howmightyoudefinetheproximitybetweentwosetsofdataobjects?(Makenoassumptionaboutthedataobjects,exceptthataproximitymeasureisdefinedbetweenanypairofobjects.)

25.YouaregivenasetofpointssinEuclideanspace,aswellasthedistanceofeachpointinstoapointx.(Itdoesnotmatterif )

a. Ifthegoalistofindallpointswithinaspecifieddistance ofpointexplainhowyoucouldusethetriangleinequalityandthealreadycalculateddistancestoxtopotentiallyreducethenumberofdistancecalculationsnecessary?Hint:Thetriangleinequality,

canberewrittenas

b. Ingeneral,howwouldthedistancebetweenxandyaffectthenumberofdistancecalculations?

c. Supposethatyoucanfindasmallsubsetofpoints fromtheoriginaldataset,suchthateverypointinthedatasetiswithinaspecifieddistance ofatleastoneofthepointsin andthatyoualsohavethepairwisedistancematrixfor Describeatechniquethatusesthisinformationtocompute,withaminimumofdistancecalculations,thesetofallpointswithinadistanceof ofaspecifiedpointfromthedataset.

26.Showthat1minustheJaccardsimilarityisadistancemeasurebetweentwodataobjects,xandy,thatsatisfiesthemetricaxiomsgivenonpage77 .Specifically,

x∈S.ε y,y≠x,

d(x,z)≤d(x,y)+d(y,x), d(x,y)≥d(x,z)−d(y,z).

S′,

ε S′,S′.

β

d(x,y)=1−J(x,y).

27.Showthatthedistancemeasuredefinedastheanglebetweentwodatavectors,xandy,satisfiesthemetricaxiomsgivenonpage77 .Specifically,

28.Explainwhycomputingtheproximitybetweentwoattributesisoftensimplerthancomputingthesimilaritybetweentwoobjects.

d(x,y)=arccos(cos(x,y)).

3Classification:BasicConceptsandTechniques

Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.

Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.

Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .

3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.

Figure3.2.Aschematicillustrationofaclassificationtask.

Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .

Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe

f(x)=y

galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.

Table3.1.Examplesofclassificationtasks.

Task Attributeset Classlabel

Spamfiltering Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Tumoridentification

Featuresextractedfrommagneticresonanceimaging(MRI)scans

malignantorbenign

Galaxyclassification

Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped

Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.

3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.

Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-

blooded

hair yes no no yes no mammal

3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe

blooded

python cold-blooded scales no no no no yes reptile

salmon cold-blooded scales no yes no no no fish

whale warm-blooded

hair yes yes no no no mammal

frog cold-blooded none no semi no yes yes amphibian

komodo cold-blooded scales no no no yes no reptile

dragon

bat warm-blooded

hair yes no yes yes yes mammal

pigeon warm-blooded

feathers no no yes yes no bird

cat warm-blooded

fur yes no no yes no mammal

leopard cold-blooded scales yes yes no no no fish

shark

turtle cold-blooded scales no semi no yes no reptile

penguin warm-blooded

feathers no semi no yes no bird

porcupine warm-blooded

quills yes no no yes yes mammal

eel cold-blooded scales no yes no no no fish

salamander cold-blooded none no semi no yes yes amphibian

classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.

Table3.3.Asampledatafortheloanborrowerclassificationproblem.

ID HomeOwner MaritalStatus AnnualIncome Defaulted?

1 Yes Single 125000 No

2 No Married 100000 No

3 No Single 70000 No

4 Yes Married 120000 No

5 No Divorced 95000 Yes

6 No Single 60000 No

7 Yes Divorced 220000 No

8 No Single 85000 Yes

9 No Married 75000 No

10 No Single 90000 Yes

Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit

isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.

Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:

Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.

Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.

Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin

VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

gilamonster

cold-blooded scales no no no yes yes ?

concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.

3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.

Figure3.3.Generalframeworkforbuildingaclassificationmodel.

Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.

Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,

globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.

InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .

Table3.4.Confusionmatrixforabinaryclassificationproblem.

PredictedClass

ActualClass

fijf01

(f11+f00)(f10+f01)

Class=1 Class=0

Class=1 f11 f10

Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:

Accuracy=

Forbinaryclassificationproblems,theaccuracyofamodelisgivenby

Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:

Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .

Class=0 f01 f00

Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)

Accuracy=f11+f00f11+f10+f01+f00. (3.2)

Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)

3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).

Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:

Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.

Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.

Figure3.4.Adecisiontreeforthemammalclassificationproblem.

Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5

tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .

Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.

3.3.1ABasicAlgorithmtoBuildaDecisionTree

Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,

albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.

Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.

Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave

theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.

Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.

LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since

Defaulted=Yes

Defaulted=No

allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,

wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .

Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.

1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.

2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.

DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.

1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand

Defaulted=No

howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .

2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .

3.3.2MethodsforExpressingAttributeTestConditions

Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.

BinaryAttributes

Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .

Figure3.7.Attributetestconditionforabinaryattribute.

NominalAttributes

Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall

waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.

2k−1−1

Figure3.8.Attributetestconditionsfornominalattributes.

OrdinalAttributes

Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.

Figure3.9.Differentwaysofgroupingordinalattributevalues.

ContinuousAttributes

Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.

A<vvi≤A<vi+1 i=1,…,k,

A<v

Figure3.10.Testconditionforcontinuousattributes.

3.3.3MeasuresforSelectinganAttributeTestCondition

Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.

Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.

ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:

wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.

Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,

)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.

Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)

Giniindex=1−∑i=0c−1pi(t)2, (3.5)

Classificationerror=1−maxi[pi(t)], (3.6)

0log20=0

p0(t)+p1(t)=1

p0(t)+p1(t)=0.5p0(t) p1(t)

Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.

Node Count

0

6

Node Count

1

5

Node Count

3

N1 Gini=1−(0/6)2−(6/6)2=0

Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0

Class=1 Error=1−max[0/6,6/6]=0

N2 Gini=1−(1/6)2−(5/6)2=0.278

Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650

Class=1 Error=1−max[1/6,5/6]=0.167

N3 Gini=1−(3/6)2−(3/6)2=0.5

Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1

3

Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).

CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:

3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes

Class=1 Error=1−max[6/6,3/6]=0.5

N1N2 N3

N1N2 N1N2

{v1,v2,⋯,vk}

N(vj)vj I(vj)

vj N(vj)/N

I(children)=∑j=1kN(vj)NI(vj), (3.7)

Figure3.12.Examplesofcandidateattributetestconditions.

whoseweightedentropycanbecalculatedasfollows:

SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby

Thus,MaritalStatushasalowerweightedentropythanHomeOwner.

IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir

I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690

I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686

difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:

Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.

whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused

Δ

Δ=I(parent)−I(children), (3.8)

I(parent)≥I(children)

astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .

Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.

SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,

TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby

TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.

BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis

Δinfo

I(parent)=−310log2310−710log2710=0.881

Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195

If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis

wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis

.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith

threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,

andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.

BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit

1−(310)2−(710)2=0.420.

N1 N2

(3/10)×0+(7/10)×0.490=0.343,

0.420−0.343=0.077

AnnualIncome≤τ

τ

ττ

positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .

Figure3.14.Splittingcontinuousattributes.

Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone

τ

O(N2)

AnnualIncome<$55,000

τ=$55,0000×0+1×0.420=0.420

τ=$65,000

τ

traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for

)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe

classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.

ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and

canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and

).

τ

AnnualIncome≤$65,000AnnualIncome>$65,000

τ=$97,500

O(N2)

τ=$65,000τ=$72,500

τ=$55,000τ=$230,000

GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute

isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,

isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.

Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:

where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.

3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:

, ,and .Theentropybeforesplittingis

If isusedasattributetestcondition:

If isusedasattributetestcondition:

Finally,if isusedasattributetestcondition:

Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N

(3.9)

N(vi) vi

∀i:N(vi)/N=1/k2

Entropy(parent)=−1020log21020−1020log21020=1.

Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029

Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52

Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.

3.3.4AlgorithmforDecisionTreeInduction

Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.

1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.

2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.

3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto

Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23

p(i|t)

theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.

Algorithm3.1Askeletondecisiontreeinductionalgorithm.

wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel

leaf.label=argmaxip(i|t), (3.10)

p(i|t)

ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.

4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .

3.3.5ExampleApplication:WebRobotDetection

Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.

p(i|t)

Figure3.15.Inputdataforwebrobotdetection.

Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.

Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.

Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.

Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits

correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.

ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:

1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).

2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.

3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.

4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.

3.3.6CharacteristicsofDecisionTreeClassifiers

Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.

1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.

2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction

involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows

(A∧B)∨(C∧D)24=16

acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).

2d

Figure3.16.Decisiontreemodelforwebrobotdetection.

Figure3.17.DecisiontreefortheBooleanfunction .

3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.

4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting

(A∧B)∨(C∧D)

nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.

Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.

Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.

5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and

+ ∘

X≤10 Y≤10

3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.

Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.

6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .

7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.

8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare

rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition

Figure3.20.

x+y<20.

Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.

Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.

Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.

9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.

3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.

Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.

Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.

3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe

+ ∘∘

+

∘ +∘

impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .

Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.

3.4.1ReasonsforModelOverfitting

Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific

patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.

+

+

+ +

Figure3.24.Decisiontreeswithdifferentmodelcomplexities.

Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).

Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.

LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,

theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.

3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.

Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).

HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf

+

+

nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.

Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.

Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.

Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis

whichisextremelylow.

Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis

whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.

Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.

(109)+(1010)210=0.0107,

1−(1−0.0107)200=0.8847,

Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.

Figure3.27.

Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.

3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.

+ ∘

Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.

Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.

Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.

3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.

3.5.1UsingaValidationSet

Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.

GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor

training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.

Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.

Figure3.29.

errval(m)errval(m)

ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .

3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.

3.5.2IncorporatingModelComplexity

Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.

GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision

errval(TL)=6/16=0.375errval(TR)=4/16=0.25

M M

treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:

where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.

Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.

Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.

M,complexity(M),

gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)

αα

α

α

α

complexity(M)

EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas

.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:

whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.

3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining

Ntrain

k/Ntrain

errgen(T)=err(T)+Ω×kNtrain,

Ω

α Ω

TL TRTL

TR

instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains

largernumberofleafnodes)than .

Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.

Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe

andthegeneralizationerrorestimatefor willbe

err(TL)=4/24=0.167err(TR)=6/24=0.25

TL TR TLTR

Ω=0.5TL

errgen(TL)=424+0.5×724=7.524=0.3125

TR

errgen(TR)=624+0.5×424=824=0.3333.

Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto

itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is

.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.

MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.

TLTR Ω=0.5

Ω=1TL errgen(TL)=11/24=0.458 TR

errgen(TR)=10/24=0.417 TR TL

Ω

ΩΩ

Θ(N)

Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.

Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is

wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.

Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)

α

NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.

3.5.3EstimatingStatisticalBounds

InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.

3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:

TRTL

2/7=0.286

where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare

and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and

,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.

3.5.4ModelSelectionforDecisionTrees

Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.

Prepruning(EarlyStoppingRule)

Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe

eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)

α zα/2

α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503

7×0.503=3.521TL

1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537

eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098

TR

computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.

Post-pruning

Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.

Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat

hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1

,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor

and remainsintact.

Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.

breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0

depth>1 MultiAgent=1

3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.

NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.

Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .

3.6.1HoldoutMethod

errtest

errtest

Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.

Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.

Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .

3.6.2Cross-Validation

Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave

errtest

errtest

errtest

errtest

randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.

Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.

Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,

S1,S2 S3S2

3 S1S1 err(S1)

S1 S3 S2err(S2) S2 S1

S3 S3err(S3) S3

th

.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas

Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.

Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.

Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,

errsum(i) errtest

errtest=∑i=1kerrsum(i)N. (3.14)

(k−1) (k−1)/k

k=N

andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.

Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.

Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.

errtest errtest

(N(k−1)/k) errtest

errtest

Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.

errtest

errtest

errtest

3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)

Forotherexamplesofhyper-parameters,seeChapter4 .

Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.

3.7.1Hyper-parameterSelection

InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,

α

gen.error(m)=train.error(m,D.train)+α×complexity(M)

α

α

P=

.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.

Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.

Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.

{p1,p2,…pn}pi mi

errval(pi) p*

m* p*

pi pi

p*m*

Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.

Algorithm3.2Proceduremodel-select(k, ,D.train)

th

th

pi

pip*

m*

P

3.7.2NestedCross-Validation

TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating

usingnestedcross-validationinthepresenceofhyper-parameters.

Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.

m*errtest

m*errtest

errtest

Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .

Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,

,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.

Algorithm3.3Thenestedcross-validationapproachforcomputing .

errtest

th

th

p*(i)m*(i)

m*(i) th

errtest

errtest

3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.

3.8.1OverlapbetweenTrainingandTestSets

Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).

errtest

errtest

errtest

ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.

3.8.2UseofValidationErrorasGeneralizationError

Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel

nolongerreflectstheperformanceof onunseeninstances.

Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis

errval

errval

m*,errval m*

P,P

possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.

ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin

anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.

p*

m* p*

m*m*

3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:

1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?

2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?

Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.

3.9.1EstimatingtheConfidenceIntervalforAccuracy

*

MA MBMA

MBMA

MB

MA MBMA

MAMB

Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:

1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.

2. Theprobabilityofsuccess,p,ineachtrialisconstant.

AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :

Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis

Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis

Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also

Np Np(1−p)

P(X=υ)=(Nυ)pυ(1−p)N−υ.

(p=0.5)

P(X=20)=(5020)0.520(1−0.5)30=0.0419.

50×0.5=25, 50×0.5×0.5=12.5.

NpNp(1−p). acc=X/N,

hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:

where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:

Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:

0.99 0.98 0.95 0.9 0.8 0.7 0.5

2.58 2.33 1.96 1.65 1.28 1.04 0.67

3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto

accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:

N 20 50 100 500 1000 5000

p(1−p)/N

P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)

Zα/2 Z1−α/2(1−α).

Z=0, Zα/2=Z1−α/2.

2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)

Zα/2

1−α

Zα/2

Za/2=1.96

Confidence 0.584 0.670 0.711 0.763 0.774 0.789

Interval

NotethattheconfidenceintervalbecomestighterwhenNincreases.

3.9.2ComparingthePerformanceofTwoModels

Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin

and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.

Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:

where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:

−0.919 −0.888 −0.867 −0.833 −0.824 −0.811

M1 M2,D1 D2. n1

D1 n2 D2.M1 D1 e1 M2 D2 e2.

e1 e2

n1 n2 e1 e2

d=e1−e2,dt σd2.

σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)

e1(1−e1)/n1 e2(1−e1)/n2(1−α)%

3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis

.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:

or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:

Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.

Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand

gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.

dt=d±zα/2σ^d. (3.18)

MAe1=0.15 N1=30

MB e2=0.25 N2=5000

d=|0.15−0.25|=0.1dt=0 dt≠0

σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043

σ^d=0.0655dt

dt=0.1±1.96×0.0655=0.1±0.128.

dt=0Zα/2 dt

Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936

3.10BibliographicNotesEarlyclassificationsystemsweredevelopedtoorganizevariouscollectionsofobjects,fromlivingorganismstoinanimateones.Examplesabound,fromAristotle'scataloguingofspeciestotheDeweyDecimalandLibraryofCongressclassificationsystemsforbooks.Suchatasktypicallyrequiresconsiderablehumanefforts,bothtoidentifypropertiesoftheobjectstobeclassifiedandtoorganizethemintowelldistinguishedcategories.

Withthedevelopmentofstatisticsandcomputing,automatedclassificationhasbeenasubjectofintensiveresearch.Thestudyofclassificationinclassicalstatisticsissometimesknownasdiscriminantanalysis,wheretheobjectiveistopredictthegroupmembershipofanobjectbasedonitscorrespondingfeatures.Awell-knownclassicalmethodisFisher'slineardiscriminantanalysis[142],whichseekstofindalinearprojectionofthedatathatproducesthebestseparationbetweenobjectsfromdifferentclasses.

Manypatternrecognitionproblemsalsorequirethediscriminationofobjectsfromdifferentclasses.Examplesincludespeechrecognition,handwrittencharacteridentification,andimageclassification.ReaderswhoareinterestedintheapplicationofclassificationtechniquesforpatternrecognitionmayrefertothesurveyarticlesbyJainetal.[150]andKulkarnietal.[157]orclassicpatternrecognitionbooksbyBishop[125],Dudaetal.[137],andFukunaga[143].Thesubjectofclassificationisalsoamajorresearchtopicinneuralnetworks,statisticallearning,andmachinelearning.Anin-depthtreatmentonthetopicofclassificationfromthestatisticalandmachinelearningperspectivescanbefoundinthebooksbyBishop[126],CherkasskyandMulier[132],Hastieetal.[148],Michieetal.[162],Murphy[167],andMitchell[165].Recentyearshavealsoseenthereleaseofmanypubliclyavailable

softwarepackagesforclassification,whichcanbeembeddedinprogramminglanguagessuchasJava(Weka[147])andPython(scikit-learn[174]).

AnoverviewofdecisiontreeinductionalgorithmscanbefoundinthesurveyarticlesbyBuntine[129],Moret[166],Murthy[168],andSafavianetal.[179].Examplesofsomewell-knowndecisiontreealgorithmsincludeCART[127],ID3[175],C4.5[177],andCHAID[153].BothID3andC4.5employtheentropymeasureastheirsplittingfunction.Anin-depthdiscussionoftheC4.5decisiontreealgorithmisgivenbyQuinlan[177].TheCARTalgorithmwasdevelopedbyBreimanetal.[127]andusestheGiniindexasitssplittingfunction.CHAID[153]usesthestatistical testtodeterminethebestsplitduringthetree-growingprocess.

Thedecisiontreealgorithmpresentedinthischapterassumesthatthesplittingconditionateachinternalnodecontainsonlyoneattribute.Anobliquedecisiontreecanusemultipleattributestoformtheattributetestconditioninasinglenode[149,187].Breimanetal.[127]provideanoptionforusinglinearcombinationsofattributesintheirCARTimplementation.OtherapproachesforinducingobliquedecisiontreeswereproposedbyHeathetal.[149],Murthyetal.[169],Cantú-PazandKamath[130],andUtgoffandBrodley[187].Althoughanobliquedecisiontreehelpstoimprovetheexpressivenessofthemodelrepresentation,thetreeinductionprocessbecomescomputationallychallenging.Anotherwaytoimprovetheexpressivenessofadecisiontreewithoutusingobliquedecisiontreesistoapplyamethodknownasconstructiveinduction[161].Thismethodsimplifiesthetaskoflearningcomplexsplittingfunctionsbycreatingcompoundfeaturesfromtheoriginaldata.

Besidesthetop-downapproach,otherstrategiesforgrowingadecisiontreeincludethebottom-upapproachbyLandeweerdetal.[159]andPattipatiandAlexandridis[173],aswellasthebidirectionalapproachbyKimand

χ2

Landgrebe[154].SchuermannandDoster[181]andWangandSuen[193]proposedusingasoftsplittingcriteriontoaddressthedatafragmentationproblem.Inthisapproach,eachinstanceisassignedtodifferentbranchesofthedecisiontreewithdifferentprobabilities.

Modeloverfittingisanimportantissuethatmustbeaddressedtoensurethatadecisiontreeclassifierperformsequallywellonpreviouslyunlabeleddatainstances.ThemodeloverfittingproblemhasbeeninvestigatedbymanyauthorsincludingBreimanetal.[127],Schaffer[180],Mingers[164],andJensenandCohen[151].Whilethepresenceofnoiseisoftenregardedasoneoftheprimaryreasonsforoverfitting[164,170],JensenandCohen[151]viewedoverfittingasanartifactoffailuretocompensateforthemultiplecomparisonsproblem.

Bishop[126]andHastieetal.[148]provideanexcellentdiscussionofmodeloverfitting,relatingittoawell-knownframeworkoftheoreticalanalysis,knownasbias-variancedecomposition[146].Inthisframework,thepredictionofalearningalgorithmisconsideredtobeafunctionofthetrainingset,whichvariesasthetrainingsetischanged.Thegeneralizationerrorofamodelisthendescribedintermsofitsbias(theerroroftheaveragepredictionobtainedusingdifferenttrainingsets),itsvariance(howdifferentarethepredictionsobtainedusingdifferenttrainingsets),andnoise(theirreducibleerrorinherenttotheproblem).Anunderfitmodelisconsideredtohavehighbiasbutlowvariance,whileanoverfitmodelisconsideredtohavelowbiasbuthighvariance.Althoughthebias-variancedecompositionwasoriginallyproposedforregressionproblems(wherethetargetattributeisacontinuousvariable),aunifiedanalysisthatisapplicableforclassificationhasbeenproposedbyDomingos[136].ThebiasvariancedecompositionwillbediscussedinmoredetailwhileintroducingensemblelearningmethodsinChapter4 .

Variouslearningprinciples,suchastheProbablyApproximatelyCorrect(PAC)learningframework[188],havebeendevelopedtoprovideatheoreticalframeworkforexplainingthegeneralizationperformanceoflearningalgorithms.Inthefieldofstatistics,anumberofperformanceestimationmethodshavebeenproposedthatmakeatrade-offbetweenthegoodnessoffitofamodelandthemodelcomplexity.MostnoteworthyamongthemaretheAkaike'sInformationCriterion[120]andtheBayesianInformationCriterion[182].Theybothapplycorrectivetermstothetrainingerrorrateofamodel,soastopenalizemorecomplexmodels.Anotherwidely-usedapproachformeasuringthecomplexityofanygeneralmodelistheVapnikChervonenkis(VC)Dimension[190].TheVCdimensionofaclassoffunctionsCisdefinedasthemaximumnumberofpointsthatcanbeshattered(everypointcanbedistinguishedfromtherest)byfunctionsbelongingtoC,foranypossibleconfigurationofpoints.TheVCdimensionlaysthefoundationofthestructuralriskminimizationprinciple[189],whichisextensivelyusedinmanylearningalgorithms,e.g.,supportvectormachines,whichwillbediscussedindetailinChapter4 .

TheOccam'srazorprincipleisoftenattributedtothephilosopherWilliamofOccam.Domingos[135]cautionedagainstthepitfallofmisinterpretingOccam'srazorascomparingmodelswithsimilartrainingerrors,insteadofgeneralizationerrors.Asurveyondecisiontree-pruningmethodstoavoidoverfittingisgivenbyBreslowandAha[128]andEspositoetal.[141].Someofthetypicalpruningmethodsincludereducederrorpruning[176],pessimisticerrorpruning[176],minimumerrorpruning[171],criticalvaluepruning[163],cost-complexitypruning[127],anderror-basedpruning[177].QuinlanandRivestproposedusingtheminimumdescriptionlengthprinciplefordecisiontreepruningin[178].

Thediscussionsinthischapteronthesignificanceofcross-validationerrorestimatesisinspiredfromChapter7 inHastieetal.[148].Itisalsoan

excellentresourceforunderstanding“therightandwrongwaystodocross-validation”,whichissimilartothediscussiononpitfallsinSection3.8 ofthischapter.Acomprehensivediscussionofsomeofthecommonpitfallsinusingcross-validationformodelselectionandevaluationisprovidedinKrstajicetal.[156].

Theoriginalcross-validationmethodwasproposedindependentlybyAllen[121],Stone[184],andGeisser[145]formodelassessment(evaluation).Eventhoughcross-validationcanbeusedformodelselection[194],itsusageformodelselectionisquitedifferentthanwhenitisusedformodelevaluation,asemphasizedbyStone[184].Overtheyears,thedistinctionbetweenthetwousageshasoftenbeenignored,resultinginincorrectfindings.Oneofthecommonmistakeswhileusingcross-validationistoperformpre-processingoperations(e.g.,hyper-parametertuningorfeatureselection)usingtheentiredatasetandnot“within”thetrainingfoldofeverycross-validationrun.Ambroiseetal.,usinganumberofgeneexpressionstudiesasexamples,[124]provideanextensivediscussionoftheselectionbiasthatariseswhenfeatureselectionisperformedoutsidecross-validation.UsefulguidelinesforevaluatingmodelsonmicroarraydatahavealsobeenprovidedbyAllisonetal.[122].

Theuseofthecross-validationprotocolforhyper-parametertuninghasbeendescribedindetailbyDudoitandvanderLaan[138].Thisapproachhasbeencalled“grid-searchcross-validation.”Thecorrectapproachinusingcross-validationforbothhyper-parameterselectionandmodelevaluation,asdiscussedinSection3.7 ofthischapter,isextensivelycoveredbyVarmaandSimon[191].Thiscombinedapproachhasbeenreferredtoas“nestedcross-validation”or“doublecross-validation”intheexistingliterature.Recently,TibshiraniandTibshirani[185]haveproposedanewapproachforhyper-parameterselectionandmodelevaluation.Tsamardinosetal.[186]comparedthisapproachtonestedcross-validation.Theexperimentsthey

performedfoundthat,onaverage,bothapproachesprovideconservativeestimatesofmodelperformancewiththeTibshiraniandTibshiraniapproachbeingmorecomputationallyefficient.

Kohavi[155]hasperformedanextensiveempiricalstudytocomparetheperformancemetricsobtainedusingdifferentestimationmethodssuchasrandomsubsamplingandk-foldcross-validation.Theirresultssuggestthatthebestestimationmethodisten-fold,stratifiedcross-validation.

Analternativeapproachformodelevaluationisthebootstrapmethod,whichwaspresentedbyEfronin1979[139].Inthismethod,traininginstancesaresampledwithreplacementfromthelabeledset,i.e.,aninstancepreviouslyselectedtobepartofthetrainingsetisequallylikelytobedrawnagain.IftheoriginaldatahasNinstances,itcanbeshownthat,onaverage,abootstrapsampleofsizeNcontainsabout63.2%oftheinstancesintheoriginaldata.Instancesthatarenotincludedinthebootstrapsamplebecomepartofthetestset.Thebootstrapprocedureforobtainingtrainingandtestsetsisrepeatedbtimes,resultinginadifferenterrorrateonthetestset,err(i),atthei run.Toobtaintheoverallerrorrate, ,the.632bootstrapapproachcombineserr(i)withtheerrorrateobtainedfromatrainingsetcontainingallthelabeledexamples, ,asfollows:

EfronandTibshirani[140]providedatheoreticalandempiricalcomparisonbetweencross-validationandabootstrapmethodknownasthe rule.

Whilethe.632bootstrapmethodpresentedaboveprovidesarobustestimateofthegeneralizationperformancewithlowvarianceinitsestimate,itmayproducemisleadingresultsforhighlycomplexmodelsincertainconditions,asdemonstratedbyKohavi[155].Thisisbecausetheoverallerrorrateisnot

th errboot

errs

errboot=1b∑i=1b(0.632)×err(i)+0.386×errs). (3.19)

632+

trulyanout-of-sampleerrorestimateasitdependsonthetrainingerrorrate,,whichcanbequitesmallifthereisoverfitting.

CurrenttechniquessuchasC4.5requirethattheentiretrainingdatasetfitintomainmemory.Therehasbeenconsiderableefforttodevelopparallelandscalableversionsofdecisiontreeinductionalgorithms.SomeoftheproposedalgorithmsincludeSLIQbyMehtaetal.[160],SPRINTbyShaferetal.[183],CMPbyWangandZaniolo[192],CLOUDSbyAlsabtietal.[123],RainForestbyGehrkeetal.[144],andScalParCbyJoshietal.[152].Asurveyofparallelalgorithmsforclassificationandotherdataminingtasksisgivenin[158].Morerecently,therehasbeenextensiveresearchtoimplementlarge-scaleclassifiersonthecomputeunifieddevicearchitecture(CUDA)[131,134]andMapReduce[133,172]platforms.

errs

Bibliography[120]H.Akaike.Informationtheoryandanextensionofthemaximum

likelihoodprinciple.InSelectedPapersofHirotuguAkaike,pages199–213.Springer,1998.

[121]D.M.Allen.Therelationshipbetweenvariableselectionanddataagumentationandamethodforprediction.Technometrics,16(1):125–127,1974.

[122]D.B.Allison,X.Cui,G.P.Page,andM.Sabripour.Microarraydataanalysis:fromdisarraytoconsolidationandconsensus.Naturereviewsgenetics,7(1):55–65,2006.

[123]K.Alsabti,S.Ranka,andV.Singh.CLOUDS:ADecisionTreeClassifierforLargeDatasets.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages2–8,NewYork,NY,August1998.

[124]C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisofmicroarraygene-expressiondata.Proceedingsofthenationalacademyofsciences,99(10):6562–6566,2002.

[125]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[126]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[127]L.Breiman,J.H.Friedman,R.Olshen,andC.J.Stone.ClassificationandRegressionTrees.Chapman&Hall,NewYork,1984.

[128]L.A.BreslowandD.W.Aha.SimplifyingDecisionTrees:ASurvey.KnowledgeEngineeringReview,12(1):1–40,1997.

[129]W.Buntine.Learningclassificationtrees.InArtificialIntelligenceFrontiersinStatistics,pages182–201.Chapman&Hall,London,1993.

[130]E.Cantú-PazandC.Kamath.Usingevolutionaryalgorithmstoinduceobliquedecisiontrees.InProc.oftheGeneticandEvolutionaryComputationConf.,pages1053–1060,SanFrancisco,CA,2000.

[131]B.Catanzaro,N.Sundaram,andK.Keutzer.Fastsupportvectormachinetrainingandclassificationongraphicsprocessors.InProceedingsofthe25thInternationalConferenceonMachineLearning,pages104–111,2008.

[132]V.CherkasskyandF.M.Mulier.LearningfromData:Concepts,Theory,andMethods.Wiley,2ndedition,2007.

[133]C.Chu,S.K.Kim,Y.-A.Lin,Y.Yu,G.Bradski,A.Y.Ng,andK.Olukotun.Map-reduceformachinelearningonmulticore.Advancesinneuralinformationprocessingsystems,19:281,2007.

[134]A.Cotter,N.Srebro,andJ.Keshet.AGPU-tailoredApproachforTrainingKernelizedSVMs.InProceedingsofthe17thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages805–813,SanDiego,California,USA,2011.

[135]P.Domingos.TheRoleofOccam'sRazorinKnowledgeDiscovery.DataMiningandKnowledgeDiscovery,3(4):409–425,1999.

[136]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[137]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[138]S.DudoitandM.J.vanderLaan.Asymptoticsofcross-validatedriskestimationinestimatorselectionandperformanceassessment.StatisticalMethodology,2(2):131–154,2005.

[139]B.Efron.Bootstrapmethods:anotherlookatthejackknife.InBreakthroughsinStatistics,pages569–593.Springer,1992.

[140]B.EfronandR.Tibshirani.Cross-validationandtheBootstrap:EstimatingtheErrorRateofaPredictionRule.Technicalreport,StanfordUniversity,1995.

[141]F.Esposito,D.Malerba,andG.Semeraro.AComparativeAnalysisofMethodsforPruningDecisionTrees.IEEETrans.PatternAnalysisandMachineIntelligence,19(5):476–491,May1997.

[142]R.A.Fisher.Theuseofmultiplemeasurementsintaxonomicproblems.AnnalsofEugenics,7:179–188,1936.

[143]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[144]J.Gehrke,R.Ramakrishnan,andV.Ganti.RainForest—AFrameworkforFastDecisionTreeConstructionofLargeDatasets.DataMiningandKnowledgeDiscovery,4(2/3):127–162,2000.

[145]S.Geisser.Thepredictivesamplereusemethodwithapplications.JournaloftheAmericanStatisticalAssociation,70(350):320–328,1975.

[146]S.Geman,E.Bienenstock,andR.Doursat.Neuralnetworksandthebias/variancedilemma.Neuralcomputation,4(1):1–58,1992.

[147]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten.TheWEKADataMiningSoftware:AnUpdate.SIGKDDExplorations,11(1),2009.

[148]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[149]D.Heath,S.Kasif,andS.Salzberg.InductionofObliqueDecisionTrees.InProc.ofthe13thIntl.JointConf.onArtificialIntelligence,pages1002–1007,Chambery,France,August1993.

[150]A.K.Jain,R.P.W.Duin,andJ.Mao.StatisticalPatternRecognition:AReview.IEEETran.Patt.Anal.andMach.Intellig.,22(1):4–37,2000.

[151]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.

[152]M.V.Joshi,G.Karypis,andV.Kumar.ScalParC:ANewScalableandEfficientParallelClassificationAlgorithmforMiningLargeDatasets.InProc.of12thIntl.ParallelProcessingSymp.(IPPS/SPDP),pages573–579,Orlando,FL,April1998.

[153]G.V.Kass.AnExploratoryTechniqueforInvestigatingLargeQuantitiesofCategoricalData.AppliedStatistics,29:119–127,1980.

[154]B.KimandD.Landgrebe.Hierarchicaldecisionclassifiersinhigh-dimensionalandlargeclassdata.IEEETrans.onGeoscienceandRemoteSensing,29(4):518–528,1991.

[155]R.Kohavi.AStudyonCross-ValidationandBootstrapforAccuracyEstimationandModelSelection.InProc.ofthe15thIntl.JointConf.onArtificialIntelligence,pages1137–1145,Montreal,Canada,August1995.

[156]D.Krstajic,L.J.Buturovic,D.E.Leahy,andS.Thomas.Cross-validationpitfallswhenselectingandassessingregressionandclassificationmodels.Journalofcheminformatics,6(1):1,2014.

[157]S.R.Kulkarni,G.Lugosi,andS.S.Venkatesh.LearningPatternClassification—ASurvey.IEEETran.Inf.Theory,44(6):2178–2206,1998.

[158]V.Kumar,M.V.Joshi,E.-H.Han,P.N.Tan,andM.Steinbach.HighPerformanceDataMining.InHighPerformanceComputingforComputationalScience(VECPAR2002),pages111–125.Springer,2002.

[159]G.Landeweerd,T.Timmers,E.Gersema,M.Bins,andM.Halic.Binarytreeversussingleleveltreeclassificationofwhitebloodcells.PatternRecognition,16:571–577,1983.

[160]M.Mehta,R.Agrawal,andJ.Rissanen.SLIQ:AFastScalableClassifierforDataMining.InProc.ofthe5thIntl.Conf.onExtendingDatabaseTechnology,pages18–32,Avignon,France,March1996.

[161]R.S.Michalski.Atheoryandmethodologyofinductivelearning.ArtificialIntelligence,20:111–116,1983.

[162]D.Michie,D.J.Spiegelhalter,andC.C.Taylor.MachineLearning,NeuralandStatisticalClassification.EllisHorwood,UpperSaddleRiver,NJ,1994.

[163]J.Mingers.ExpertSystems—RuleInductionwithStatisticalData.JOperationalResearchSociety,38:39–47,1987.

[164]J.Mingers.Anempiricalcomparisonofpruningmethodsfordecisiontreeinduction.MachineLearning,4:227–243,1989.

[165]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[166]B.M.E.Moret.DecisionTreesandDiagrams.ComputingSurveys,14(4):593–623,1982.

[167]K.P.Murphy.MachineLearning:AProbabilisticPerspective.MITPress,2012.

[168]S.K.Murthy.AutomaticConstructionofDecisionTreesfromData:AMulti-DisciplinarySurvey.DataMiningandKnowledgeDiscovery,2(4):345–389,1998.

[169]S.K.Murthy,S.Kasif,andS.Salzberg.Asystemforinductionofobliquedecisiontrees.JofArtificialIntelligenceResearch,2:1–33,1994.

[170]T.Niblett.Constructingdecisiontreesinnoisydomains.InProc.ofthe2ndEuropeanWorkingSessiononLearning,pages67–78,Bled,Yugoslavia,May1987.

[171]T.NiblettandI.Bratko.LearningDecisionRulesinNoisyDomains.InResearchandDevelopmentinExpertSystemsIII,Cambridge,1986.

CambridgeUniversityPress.

[172]I.PalitandC.K.Reddy.Scalableandparallelboostingwithmapreduce.IEEETransactionsonKnowledgeandDataEngineering,24(10):1904–1916,2012.

[173]K.R.PattipatiandM.G.Alexandridis.Applicationofheuristicsearchandinformationtheorytosequentialfaultdiagnosis.IEEETrans.onSystems,Man,andCybernetics,20(4):872–887,1990.

[174]F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos,D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.Scikit-learn:MachineLearninginPython.JournalofMachineLearningResearch,12:2825–2830,2011.

[175]J.R.Quinlan.Discoveringrulesbyinductionfromlargecollectionofexamples.InD.Michie,editor,ExpertSystemsintheMicroElectronicAge.EdinburghUniversityPress,Edinburgh,UK,1979.

[176]J.R.Quinlan.SimplifyingDecisionTrees.Intl.J.Man-MachineStudies,27:221–234,1987.

[177]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[178]J.R.QuinlanandR.L.Rivest.InferringDecisionTreesUsingtheMinimumDescriptionLengthPrinciple.InformationandComputation,80(3):227–248,1989.

[179]S.R.SafavianandD.Landgrebe.ASurveyofDecisionTreeClassifierMethodology.IEEETrans.Systems,ManandCybernetics,22:660–674,May/June1998.

[180]C.Schaffer.Overfittingavoidenceasbias.MachineLearning,10:153–178,1993.

[181]J.SchuermannandW.Doster.Adecision-theoreticapproachinhierarchicalclassifierdesign.PatternRecognition,17:359–369,1984.

[182]G.Schwarzetal.Estimatingthedimensionofamodel.Theannalsofstatistics,6(2):461–464,1978.

[183]J.C.Shafer,R.Agrawal,andM.Mehta.SPRINT:AScalableParallelClassifierforDataMining.InProc.ofthe22ndVLDBConf.,pages544–555,Bombay,India,September1996.

[184]M.Stone.Cross-validatorychoiceandassessmentofstatisticalpredictions.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages111–147,1974.

[185]R.J.TibshiraniandR.Tibshirani.Abiascorrectionfortheminimumerrorrateincross-validation.TheAnnalsofAppliedStatistics,pages822–

829,2009.

[186]I.Tsamardinos,A.Rakhshani,andV.Lagani.Performance-estimationpropertiesofcross-validation-basedprotocolswithsimultaneoushyper-parameteroptimization.InHellenicConferenceonArtificialIntelligence,pages1–14.Springer,2014.

[187]P.E.UtgoffandC.E.Brodley.Anincrementalmethodforfindingmultivariatesplitsfordecisiontrees.InProc.ofthe7thIntl.Conf.onMachineLearning,pages58–65,Austin,TX,June1990.

[188]L.Valiant.Atheoryofthelearnable.CommunicationsoftheACM,27(11):1134–1142,1984.

[189]V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.

[190]V.N.VapnikandA.Y.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.InMeasuresofComplexity,pages11–30.Springer,2015.

[191]S.VarmaandR.Simon.Biasinerrorestimationwhenusingcross-validationformodelselection.BMCbioinformatics,7(1):1,2006.

[192]H.WangandC.Zaniolo.CMP:AFastDecisionTreeClassifierUsingMultivariatePredictions.InProc.ofthe16thIntl.Conf.onDataEngineering,pages449–460,SanDiego,CA,March2000.

[193]Q.R.WangandC.Y.Suen.Largetreeclassifierwithheuristicsearchandglobaltraining.IEEETrans.onPatternAnalysisandMachineIntelligence,9(1):91–102,1987.

[194]Y.ZhangandY.Yang.Cross-validationforselectingamodelselectionprocedure.JournalofEconometrics,187(1):95–112,2015.

3.11Exercises1.DrawthefulldecisiontreefortheparityfunctionoffourBooleanattributes,A,B,C,andD.Isitpossibletosimplifythetree?

2.ConsiderthetrainingexamplesshowninTable3.5 forabinaryclassificationproblem.

Table3.5.DatasetforExercise2.

CustomerID Gender CarType ShirtSize Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports ExtraLarge C0

6 M Sports ExtraLarge C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family ExtraLarge C1

13 M Family Medium C1

14 M Luxury ExtraLarge C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1

a. ComputetheGiniindexfortheoverallcollectionoftrainingexamples.

b. ComputetheGiniindexforthe attribute.

c. ComputetheGiniindexforthe attribute.

d. ComputetheGiniindexforthe attributeusingmultiwaysplit.

e. ComputetheGiniindexforthe attributeusingmultiwaysplit.

f. Whichattributeisbetter, , ,or ?

g. Explainwhy shouldnotbeusedastheattributetestconditioneventhoughithasthelowestGini.

3.ConsiderthetrainingexamplesshowninTable3.6 forabinaryclassificationproblem.

Table3.6.DatasetforExercise3.

Instance TargetClassa1 a2 a3

1 T T 1.0 +

2 T T 6.0

3 T F 5.0

4 F F 4.0

5 F T 7.0

6 F T 3.0

7 F F 8.0

8 T F 7.0

9 F T 5.0

a. Whatistheentropyofthiscollectionoftrainingexampleswithrespecttotheclassattribute?

b. Whataretheinformationgainsof and relativetothesetrainingexamples?

c. For ,whichisacontinuousattribute,computetheinformationgainforeverypossiblesplit.

d. Whatisthebestsplit(among ,and )accordingtotheinformationgain?

e. Whatisthebestsplit(between and )accordingtothemisclassificationerrorrate?

f. Whatisthebestsplit(between and )accordingtotheGiniindex?

+

+

+

a1 a2

a3

a1,a2 a3

a1 a2

a1 a2

4.Showthattheentropyofanodeneverincreasesaftersplittingitintosmallersuccessornodes.

5.Considerthefollowingdatasetforabinaryclassproblem.

A B ClassLabel

T F

T T

T T

T F

T T

F F

F F

F F

T T

T F

a. CalculatetheinformationgainwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

b. CalculatethegainintheGiniindexwhensplittingonAandB.Whichattributewouldthedecisiontreeinductionalgorithmchoose?

c. Figure3.11 showsthatentropyandtheGiniindexarebothmonotonicallyincreasingontherange[0,0.5]andtheyarebothmonotonicallydecreasingontherange[0.5,1].Isitpossiblethat

+

+

+

+

informationgainandthegainintheGiniindexfavordifferentattributes?Explain.

6.ConsidersplittingaparentnodePintotwochildnodes, and ,usingsomeattributetestcondition.ThecompositionoflabeledtraininginstancesateverynodeissummarizedintheTablebelow.

P

Class0 7 3 4

Class1 3 0 3

a. CalculatetheGiniindexandmisclassificationerrorrateoftheparentnodeP.

b. CalculatetheweightedGiniindexofthechildnodes.WouldyouconsiderthisattributetestconditionifGiniisusedastheimpuritymeasure?

c. Calculatetheweightedmisclassificationrateofthechildnodes.Wouldyouconsiderthisattributetestconditionifmisclassificationrateisusedastheimpuritymeasure?

7.Considerthefollowingsetoftrainingexamples.

X Y Z No.ofClassC1Examples No.ofClassC2Examples

0 0 0 5 40

0 0 1 0 15

0 1 0 10 5

0 1 1 45 0

C1 C2

C1 C2

1 0 0 10 5

1 0 1 25 0

1 1 0 5 20

1 1 1 0 15

a. Computeatwo-leveldecisiontreeusingthegreedyapproachdescribedinthischapter.Usetheclassificationerrorrateasthecriterionforsplitting.Whatistheoverallerrorrateoftheinducedtree?

b. Repeatpart(a)usingXasthefirstsplittingattributeandthenchoosethebestremainingattributeforsplittingateachofthetwosuccessornodes.Whatistheerrorrateoftheinducedtree?

c. Comparetheresultsofparts(a)and(b).Commentonthesuitabilityofthegreedyheuristicusedforsplittingattributeselection.

8.ThefollowingtablesummarizesadatasetwiththreeattributesA,B,Candtwoclasslabels .Buildatwo-leveldecisiontree.

A B C NumberofInstances

+

T T T 5 0

F T T 0 20

T F T 20 0

F F T 0 5

T T F 0 0

+,−

F T F 25 0

T F F 0 0

F F F 0 25

a. Accordingtotheclassificationerrorrate,whichattributewouldbechosenasthefirstsplittingattribute?Foreachattribute,showthecontingencytableandthegainsinclassificationerrorrate.

b. Repeatforthetwochildrenoftherootnode.

c. Howmanyinstancesaremisclassifiedbytheresultingdecisiontree?

d. Repeatparts(a),(b),and(c)usingCasthesplittingattribute.

e. Usetheresultsinparts(c)and(d)toconcludeaboutthegreedynatureofthedecisiontreeinductionalgorithm.

9.ConsiderthedecisiontreeshowninFigure3.36 .

Figure3.36.DecisiontreeanddatasetsforExercise9.

a. Computethegeneralizationerrorrateofthetreeusingtheoptimisticapproach.

b. Computethegeneralizationerrorrateofthetreeusingthepessimisticapproach.(Forsimplicity,usethestrategyofaddingafactorof0.5toeachleafnode.)

c. Computethegeneralizationerrorrateofthetreeusingthevalidationsetshownabove.Thisapproachisknownasreducederrorpruning.

10.ConsiderthedecisiontreesshowninFigure3.37 .Assumetheyaregeneratedfromadatasetthatcontains16binaryattributesand3classes,

,and .C1,C2 C3

Computethetotaldescriptionlengthofeachdecisiontreeaccordingtothefollowingformulationoftheminimumdescriptionlengthprinciple.

Thetotaldescriptionlengthofatreeisgivenby

EachinternalnodeofthetreeisencodedbytheIDofthesplittingattribute.Iftherearemattributes,thecostofencodingeachattributeis

bits.

Figure3.37.DecisiontreesforExercise10.

EachleafisencodedusingtheIDoftheclassitisassociatedwith.Iftherearekclasses,thecostofencodingaclassis bits.

Cost(tree)isthecostofencodingallthenodesinthetree.Tosimplifythecomputation,youcanassumethatthetotalcostofthetreeisobtainedbyaddingupthecostsofencodingeachinternalnodeandeachleafnode.

Cost(tree,data)=Cost(tree)+Cost(data|tree).

log2m

log2k

isencodedusingtheclassificationerrorsthetreecommitsonthetrainingset.Eacherrorisencodedby bits,wherenisthetotalnumberoftraininginstances.

Whichdecisiontreeisbetter,accordingtotheMDLprinciple?

11.Thisexercise,inspiredbythediscussionsin[155],highlightsoneoftheknownlimitationsoftheleave-one-outmodelevaluationprocedure.Letusconsideradatasetcontaining50positiveand50negativeinstances,wheretheattributesarepurelyrandomandcontainnoinformationabouttheclasslabels.Hence,thegeneralizationerrorrateofanyclassificationmodellearnedoverthisdataisexpectedtobe0.5.Letusconsideraclassifierthatassignsthemajorityclasslabeloftraininginstances(tiesresolvedbyusingthepositivelabelasthedefaultclass)toanytestinstance,irrespectiveofitsattributevalues.Wecancallthisapproachasthemajorityinducerclassifier.Determinetheerrorrateofthisclassifierusingthefollowingmethods.

a. Leave-one-out.

b. 2-foldstratifiedcross-validation,wheretheproportionofclasslabelsateveryfoldiskeptsameasthatoftheoveralldata.

c. Fromtheresultsabove,whichmethodprovidesamorereliableevaluationoftheclassifier'sgeneralizationerrorrate?

12.Consideralabeleddatasetcontaining100datainstances,whichisrandomlypartitionedintotwosetsAandB,eachcontaining50instances.WeuseAasthetrainingsettolearntwodecisiontrees, with10leafnodesand with100leafnodes.TheaccuraciesofthetwodecisiontreesondatasetsAandBareshowninTable3.7 .

Table3.7.Comparingthetestaccuracyofdecisiontrees and .

Accuracy

Cost(data|tree)log2n

T10T100

T10 T100

DataSet

A 0.86 0.97

B 0.84 0.77

a. BasedontheaccuraciesshowninTable3.7 ,whichclassificationmodelwouldyouexpecttohavebetterperformanceonunseeninstances?

b. Now,youtested and ontheentiredataset andfoundthattheclassificationaccuracyof ondataset is0.85,whereastheclassificationaccuracyof onthedataset is0.87.BasedonthisnewinformationandyourobservationsfromTable3.7 ,whichclassificationmodelwouldyoufinallychooseforclassification?

13.ConsiderthefollowingapproachfortestingwhetheraclassifierAbeatsanotherclassifierB.LetNbethesizeofagivendataset, betheaccuracyofclassifierA, betheaccuracyofclassifierB,and betheaverageaccuracyforbothclassifiers.TotestwhetherclassifierAissignificantlybetterthanB,thefollowingZ-statisticisused:

ClassifierAisassumedtobebetterthanclassifierBif .

Table3.8 comparestheaccuraciesofthreedifferentclassifiers,decisiontreeclassifiers,naïveBayesclassifiers,andsupportvectormachines,onvariousdatasets.(ThelattertwoclassifiersaredescribedinChapter4 .)

SummarizetheperformanceoftheclassifiersgiveninTable3.8 usingthefollowing table:

win-loss-draw Decisiontree NaïveBayes Supportvectormachine

T10 T100

T10 T100 (A+B)T10 (A+B)

T100 (A+B)

pApB p=(pA+pB)/2

Z=pA−pB2p(1−p)N.

Z>1.96

3×3

Decisiontree 0-0-23

NaïveBayes 0-0-23

Supportvectormachine 0-0-23

Table3.8.Comparingtheaccuracyofvariousclassificationmethods.

DataSet Size(N) DecisionTree(%)

naïveBayes(%)

Supportvectormachine(%)

Anneal 898 92.09 79.62 87.19

Australia 690 85.51 76.81 84.78

Auto 205 81.95 58.05 70.73

Breast 699 95.14 95.99 96.42

Cleve 303 76.24 83.50 84.49

Credit 690 85.80 77.54 85.07

Diabetes 768 72.40 75.91 76.82

German 1000 70.90 74.70 74.40

Glass 214 67.29 48.59 59.81

Heart 270 80.00 84.07 83.70

Hepatitis 155 81.94 83.23 87.10

Horse 368 85.33 78.80 82.61

Ionosphere 351 89.17 82.34 88.89

Iris 150 94.67 95.33 96.00

Labor 57 78.95 94.74 92.98

Led7 3200 73.34 73.16 73.56

Lymphography 148 77.03 83.11 86.49

Pima 768 74.35 76.04 76.95

Sonar 208 78.85 69.71 76.92

Tic-tac-toe 958 83.72 70.04 98.33

Vehicle 846 71.04 45.04 74.94

Wine 178 94.38 96.63 98.88

Zoo 101 93.07 93.07 96.04

Eachcellinthetablecontainsthenumberofwins,losses,anddrawswhencomparingtheclassifierinagivenrowtotheclassifierinagivencolumn.

14.LetXbeabinomialrandomvariablewithmean andvariance .ShowthattheratioX/Nalsohasabinomialdistributionwithmeanpandvariance .

Np Np(1−p)

p(1−p)N

4Classification:AlternativeTechniques

Thepreviouschapterintroducedtheclassificationproblemandpresentedatechniqueknownasthedecisiontreeclassifier.Issuessuchasmodeloverfittingandmodelevaluationwerealsodiscussed.Thischapterpresentsalternativetechniquesforbuildingclassificationmodels—fromsimpletechniquessuchasrule-basedandnearestneighborclassifierstomoresophisticatedtechniquessuchasartificialneuralnetworks,deeplearning,supportvectormachines,andensemblemethods.Otherpracticalissuessuchastheclassimbalanceandmulticlassproblemsarealsodiscussedattheendofthechapter.

4.1TypesofClassifiersBeforepresentingspecifictechniques,wefirstcategorizethedifferenttypesofclassifiersavailable.Onewaytodistinguishclassifiersisbyconsideringthecharacteristicsoftheiroutput.

BinaryversusMulticlass

Binaryclassifiersassigneachdatainstancetooneoftwopossiblelabels,typicallydenotedas and .Thepositiveclassusuallyreferstothecategorywearemoreinterestedinpredictingcorrectlycomparedtothenegativeclass(e.g.,the categoryinemailclassificationproblems).Iftherearemorethantwopossiblelabelsavailable,thenthetechniqueisknownasamulticlassclassifier.Assomeclassifiersweredesignedforbinaryclassesonly,theymustbeadaptedtodealwithmulticlassproblems.TechniquesfortransformingbinaryclassifiersintomulticlassclassifiersaredescribedinSection4.12 .

DeterministicversusProbabilistic

Adeterministicclassifierproducesadiscrete-valuedlabeltoeachdatainstanceitclassifieswhereasaprobabilisticclassifierassignsacontinuousscorebetween0and1toindicatehowlikelyitisthataninstancebelongstoaparticularclass,wheretheprobabilityscoresforalltheclassessumupto1.SomeexamplesofprobabilisticclassifiersincludethenaïveBayesclassifier,Bayesiannetworks,andlogisticregression.Probabilisticclassifiersprovideadditionalinformationabouttheconfidenceinassigninganinstancetoaclassthandeterministicclassifiers.Adatainstanceistypicallyassignedtotheclass

+1 −1

withthehighestprobabilityscore,exceptwhenthecostofmisclassifyingtheclasswithlowerprobabilityissignificantlyhigher.Wewilldiscussthetopicofcost-sensitiveclassificationwithprobabilisticoutputsinSection4.11.2 .

Anotherwaytodistinguishthedifferenttypesofclassifiersisbasedontheirtechniquefordiscriminatinginstancesfromdifferentclasses.

LinearversusNonlinear

Alinearclassifierusesalinearseparatinghyperplanetodiscriminateinstancesfromdifferentclasseswhereasanonlinearclassifierenablestheconstructionofmorecomplex,nonlineardecisionsurfaces.Weillustrateanexampleofalinearclassifier(perceptron)anditsnonlinearcounterpart(multi-layerneuralnetwork)inSection4.7 .Althoughthelinearityassumptionmakesthemodellessflexibleintermsoffittingcomplexdata,linearclassifiersarethuslesssusceptibletomodeloverfittingcomparedtononlinearclassifiers.Furthermore,onecantransformtheoriginalsetofattributes,

,intoamorecomplexfeatureset,e.g.,,beforeapplyingthelinearclassifier.Suchfeature

transformationallowsthelinearclassifiertofitdatasetswithnonlineardecisionsurfaces(seeSection4.9.4 ).

GlobalversusLocal

Aglobalclassifierfitsasinglemodeltotheentiredataset.Unlessthemodelishighlynonlinear,thisone-size-fits-allstrategymaynotbeeffectivewhentherelationshipbetweentheattributesandtheclasslabelsvariesovertheinputspace.Incontrast,alocalclassifierpartitionstheinputspaceintosmallerregionsandfitsadistinctmodeltotraininginstancesineachregion.Thek-nearestneighborclassifier(seeSection4.3 )isaclassicexampleoflocalclassifiers.Whilelocalclassifiersaremoreflexibleintermsoffittingcomplex

x=(x1,x2,⋯,xd) Φ(x)=(x1,x2,x1x2,x12,x22,⋯)

decisionboundaries,theyarealsomoresusceptibletothemodeloverfittingproblem,especiallywhenthelocalregionscontainfewtrainingexamples.

GenerativeversusDiscriminative

Givenadatainstance ,theprimaryobjectiveofanyclassifieristopredicttheclasslabel,y,ofthedatainstance.However,apartfrompredictingtheclasslabel,wemayalsobeinterestedindescribingtheunderlyingmechanismthatgeneratestheinstancesbelongingtoeveryclasslabel.Forexample,intheprocessofclassifyingspamemailmessages,itmaybeusefultounderstandthetypicalcharacteristicsofemailmessagesthatarelabeledasspam,e.g.,specificusageofkeywordsinthesubjectorthebodyoftheemail.Classifiersthatlearnagenerativemodelofeveryclassintheprocessofpredictingclasslabelsareknownasgenerativeclassifiers.SomeexamplesofgenerativeclassifiersincludethenaïveBayesclassifierandBayesiannetworks.Incontrast,discriminativeclassifiersdirectlypredicttheclasslabelswithoutexplicitlydescribingthedistributionofeveryclasslabel.Theysolveasimplerproblemthangenerativemodelssincetheydonothavetheonusofderivinginsightsaboutthegenerativemechanismofdatainstances.Theyarethussometimespreferredovergenerativemodels,especiallywhenitisnotcrucialtoobtaininformationaboutthepropertiesofeveryclass.Someexamplesofdiscriminativeclassifiersincludedecisiontrees,rule-basedclassifier,nearestneighborclassifier,artificialneuralnetworks,andsupportvectormachines.

4.2Rule-BasedClassifierArule-basedclassifierusesacollectionof“if…then…”rules(alsoknownasaruleset)toclassifydatainstances.Table4.1 showsanexampleofarulesetgeneratedforthevertebrateclassificationproblemdescribedinthepreviouschapter.Eachclassificationruleintherulesetcanbeexpressedinthefollowingway:

Theleft-handsideoftheruleiscalledtheruleantecedentorprecondition.Itcontainsaconjunctionofattributetestconditions:

where isanattribute-valuepairandopisacomparisonoperatorchosenfromtheset .Eachattributetest isalsoknownasaconjunct.Theright-handsideoftheruleiscalledtheruleconsequent,whichcontainsthepredictedclass .

Arulercoversadatainstancexifthepreconditionofrmatchestheattributesofx.risalsosaidtobefiredortriggeredwheneveritcoversagiveninstance.Foranillustration,considertherule giveninTable4.1 andthefollowingattributesfortwovertebrates:hawkandgrizzlybear.

Table4.1.Exampleofarulesetforthevertebrateclassificationproblem.

ri:(Conditioni)→yi. (4.1)

Conditioni=(A1opv1)∧(A2opv2)…(Akopvk), (4.2)

(Aj,vj){=,≠,<,>,≤,≥} (Ajopvj)

yi

r1

r1:(GivesBirth=no)∧(AerialCreature=yes)→Birdsr2:(GivesBirth=no)∧(AquaticCreature=yes)→Fishesr3:(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammalsr4:(GivesBirth=no)∧(AerialCreature=no)→Reptilesr5:(AquaticCreature=semi)→Amphibians

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

hawk warm-blooded

feather no no yes yes no

grizzlybear

warm-blooded

fur yes no no yes yes

coversthefirstvertebratebecauseitspreconditionissatisfiedbythehawk'sattributes.Theruledoesnotcoverthesecondvertebratebecausegrizzlybearsgivebirthtotheiryoungandcannotfly,thusviolatingthepreconditionof .

Thequalityofaclassificationrulecanbeevaluatedusingmeasuressuchascoverageandaccuracy.GivenadatasetDandaclassificationruler: ,thecoverageoftheruleisthefractionofinstancesinDthattriggertheruler.Ontheotherhand,itsaccuracyorconfidencefactoristhefractionofinstancestriggeredbyrwhoseclasslabelsareequaltoy.Theformaldefinitionsofthesemeasuresare

where isthenumberofinstancesthatsatisfytheruleantecedent, isthenumberofinstancesthatsatisfyboththeantecedentandconsequent,and

isthetotalnumberofinstances.

Example4.1.ConsiderthedatasetshowninTable4.2 .Therule

r1

r1

A→y

Coverage(r)=|A||D|Coverage(r)=|A∩y||A|, (4.3)

|A| |A∩y|

|D|

(GivesBirth=yes)∧(BodyTemperature=warm-blooded)→Mammals

hasacoverageof33%sincefiveofthefifteeninstancessupporttheruleantecedent.Theruleaccuracyis100%becauseallfivevertebratescoveredbytherulearemammals.

Table4.2.Thevertebratedataset.Name Body

TemperatureSkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-blooded

hair yes no no yes no Mammals

python cold-blooded scales no no no no yes Reptiles

salmon cold-blooded scales no yes no no no Fishes

whale warm-blooded

hair yes yes no no no Mammals

frog cold-blooded none no semi no yes yes Amphibians

komododragon

cold-blooded scales no no no yes no Reptiles

bat warm-blooded

hair yes no yes yes yes Mammals

pigeon warm-blooded

feathers no no yes yes no Birds

cat warm-blooded

fur yes no no yes no Mammals

guppy cold-blooded scales yes yes no no no Fishes

alligator cold-blooded scales no semi no yes no Reptiles

penguin warm-

blooded

feathers no semi no yes no Birds

porcupine warm-blooded

quills yes no no yes yes Mammals

eel cold-blooded scales no yes no no no Fishes

4.2.1HowaRule-BasedClassifierWorks

Arule-basedclassifierclassifiesatestinstancebasedontheruletriggeredbytheinstance.Toillustratehowarule-basedclassifierworks,considertherulesetshowninTable4.1 andthefollowingvertebrates:

Name BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates

lemur warm-blooded

fur yes no no yes yes

turtle cold-blooded scales no semi no yes no

dogfishshark

cold-blooded scales yes yes no no no

Thefirstvertebrate,whichisalemur,iswarm-bloodedandgivesbirthtoitsyoung.Ittriggerstherule ,andthus,isclassifiedasamammal.Thesecondvertebrate,whichisaturtle,triggerstherules and .Sincetheclassespredictedbytherulesarecontradictory(reptilesversusamphibians),theirconflictingclassesmustberesolved.Noneoftherulesareapplicabletoadogfishshark.Inthiscase,weneedtodeterminewhatclasstoassigntosuchatestinstance.

eel cold-blooded scales no yes no no no Fishes

salamander cold-blooded none no semi no yes yes Amphibians

r3r4 r5

4.2.2PropertiesofaRuleSet

Therulesetgeneratedbyarule-basedclassifiercanbecharacterizedbythefollowingtwoproperties.

Definition4.1(MutuallyExclusiveRuleSet).TherulesinarulesetRaremutuallyexclusiveifnotworulesinRaretriggeredbythesameinstance.ThispropertyensuresthateveryinstanceiscoveredbyatmostoneruleinR.

Definition4.2(ExhaustiveRuleSet).ArulesetRhasexhaustivecoverageifthereisaruleforeachcombinationofattributevalues.ThispropertyensuresthateveryinstanceiscoveredbyatleastoneruleinR.

Table4.3.Exampleofamutuallyexclusiveandexhaustiveruleset.

r1:(BodyTemperature=cold-blooded)→Non-mammalsr2:(BodyTemperature=warm-blooded)∧(GivesBirth=yes)→Mammalsr3:(BodyTemperature=warm-

Together,thesetwopropertiesensurethateveryinstanceiscoveredbyexactlyonerule.AnexampleofamutuallyexclusiveandexhaustiverulesetisshowninTable4.3 .Unfortunately,manyrule-basedclassifiers,includingtheoneshowninTable4.1 ,donothavesuchproperties.Iftherulesetisnotexhaustive,thenadefaultrule, ,mustbeaddedtocovertheremainingcases.Adefaultrulehasanemptyantecedentandistriggeredwhenallotherruleshavefailed. isknownasthedefaultclassandistypicallyassignedtothemajorityclassoftraininginstancesnotcoveredbytheexistingrules.Iftherulesetisnotmutuallyexclusive,thenaninstancecanbecoveredbymorethanonerule,someofwhichmaypredictconflictingclasses.

Definition4.3(OrderedRuleSet).TherulesinanorderedrulesetRarerankedindecreasingorderoftheirpriority.Anorderedrulesetisalsoknownasadecisionlist.

Therankofarulecanbedefinedinmanyways,e.g.,basedonitsaccuracyortotaldescriptionlength.Whenatestinstanceispresented,itwillbeclassifiedbythehighest-rankedrulethatcoverstheinstance.Thisavoidstheproblemofhavingconflictingclassespredictedbymultipleclassificationrulesiftherulesetisnotmutuallyexclusive.

blooded)∧(GivesBirth=no)→Non-mammals

rd:()→yd

yd

Analternativewaytohandleanon-mutuallyexclusiverulesetwithoutorderingtherulesistoconsidertheconsequentofeachruletriggeredbyatestinstanceasavoteforaparticularclass.Thevotesarethentalliedtodeterminetheclasslabelofthetestinstance.Theinstanceisusuallyassignedtotheclassthatreceivesthehighestnumberofvotes.Thevotemayalsobeweightedbytherule'saccuracy.Usingunorderedrulestobuildarule-basedclassifierhasbothadvantagesanddisadvantages.Unorderedrulesarelesssusceptibletoerrorscausedbythewrongrulebeingselectedtoclassifyatestinstanceunlikeclassifiersbasedonorderedrules,whicharesensitivetothechoiceofrule-orderingcriteria.Modelbuildingisalsolessexpensivebecausetherulesdonotneedtobekeptinsortedorder.Nevertheless,classifyingatestinstancecanbequiteexpensivebecausetheattributesofthetestinstancemustbecomparedagainstthepreconditionofeveryruleintheruleset.

Inthenexttwosections,wepresenttechniquesforextractinganorderedrulesetfromdata.Arule-basedclassifiercanbeconstructedusing(1)directmethods,whichextractclassificationrulesdirectlyfromdata,and(2)indirectmethods,whichextractclassificationrulesfrommorecomplexclassificationmodels,suchasdecisiontreesandneuralnetworks.DetaileddiscussionsofthesemethodsarepresentedinSections4.2.3 and4.2.4 ,respectively.

4.2.3DirectMethodsforRuleExtraction

Toillustratethedirectmethod,weconsiderawidely-usedruleinductionalgorithmcalledRIPPER.Thisalgorithmscalesalmostlinearlywiththenumberoftraininginstancesandisparticularlysuitedforbuildingmodelsfrom

datasetswithimbalancedclassdistributions.RIPPERalsoworkswellwithnoisydatabecauseitusesavalidationsettopreventmodeloverfitting.

RIPPERusesthesequentialcoveringalgorithmtoextractrulesdirectlyfromdata.Rulesaregrowninagreedyfashiononeclassatatime.Forbinaryclassproblems,RIPPERchoosesthemajorityclassasitsdefaultclassandlearnstherulestodetectinstancesfromtheminorityclass.Formulticlassproblems,theclassesareorderedaccordingtotheirprevalenceinthetrainingset.Let betheorderedlistofclasses,where istheleastprevalentclassand isthemostprevalentclass.Alltraininginstancesthatbelongto areinitiallylabeledaspositiveexamples,whilethosethatbelongtootherclassesarelabeledasnegativeexamples.Thesequentialcoveringalgorithmlearnsasetofrulestodiscriminatethepositivefromnegativeexamples.Next,alltraininginstancesfrom arelabeledaspositive,whilethosefromclasses arelabeledasnegative.Thesequentialcoveringalgorithmwouldlearnthenextsetofrulestodistinguish fromotherremainingclasses.Thisprocessisrepeateduntilweareleftwithonlyoneclass, ,whichisdesignatedasthedefaultclass.

Example4.1.Sequentialcoveringalgorithm.

(y1,y2,…,yc) y1yc

y1

y2y3,y4,⋯,yc

y2

yc

AsummaryofthesequentialcoveringalgorithmisshowninAlgorithm4.1 .Thealgorithmstartswithanemptydecisionlist,R,andextractsrulesforeachclassbasedontheorderingspecifiedbytheclassprevalence.ItiterativelyextractstherulesforagivenclassyusingtheLearn-One-Rulefunction.Oncesucharuleisfound,allthetraininginstancescoveredbytheruleareeliminated.ThenewruleisaddedtothebottomofthedecisionlistR.Thisprocedureisrepeateduntilthestoppingcriterionismet.Thealgorithmthenproceedstogeneraterulesforthenextclass.

Figure4.1 demonstrateshowthesequentialcoveringalgorithmworksforadatasetthatcontainsacollectionofpositiveandnegativeexamples.TheruleR1,whosecoverageisshowninFigure4.1(b) ,isextractedfirstbecauseitcoversthelargestfractionofpositiveexamples.AllthetraininginstancescoveredbyR1aresubsequentlyremovedandthealgorithmproceedstolookforthenextbestrule,whichisR2.

Learn-One-RuleFunctionFindinganoptimalruleiscomputationallyexpensiveduetotheexponentialsearchspacetoexplore.TheLearn-One-Rulefunctionaddressesthisproblembygrowingtherulesinagreedyfashion.Itgeneratesaninitialrule

,wheretheleft-handsideisanemptysetandtheright-handsidecorrespondstothepositiveclass.Itthenrefinestheruleuntilacertainstoppingcriterionismet.Theaccuracyoftheinitialrulemaybepoorbecausesomeofthetraininginstancescoveredbytherulebelongtothenegative

r:{}→+

class.Anewconjunctmustbeaddedtotheruleantecedenttoimproveitsaccuracy.

Figure4.1.Anexampleofthesequentialcoveringalgorithm.

RIPPERusestheFOIL'sinformationgainmeasuretochoosethebestconjuncttobeaddedintotheruleantecedent.Themeasuretakesintoconsiderationboththegaininaccuracyandsupportofacandidaterule,wheresupportisdefinedasthenumberofpositiveexamplescoveredbytherule.Forexample,supposetherule initiallycovers positiveexamplesand negativeexamples.AfteraddinganewconjunctB,theextendedrule covers positiveexamplesand negative

r:A→+ p0n0r′:A∧B→+ p1 n1

examples.TheFOIL'sinformationgainoftheextendedruleiscomputedasfollows:

RIPPERchoosestheconjunctwithhighestFOIL'sinformationgaintoextendtherule,asillustratedinthenextexample.

Example4.2.[Foil'sInformationGain]ConsiderthetrainingsetforthevertebrateclassificationproblemshowninTable4.2 .SupposethetargetclassfortheLearn-One-Rulefunctionismammals.Initially,theantecedentoftherule covers5positiveand10negativeexamples.Thus,theaccuracyoftheruleisonly0.333.Next,considerthefollowingthreecandidateconjunctstobeaddedtotheleft-handsideoftherule: ,and .ThenumberofpositiveandnegativeexamplescoveredbytheruleafteraddingeachconjunctalongwiththeirrespectiveaccuracyandFOIL'sinformationgainareshowninthefollowingtable.

Candidaterule Accuracy InfoGain

3 0 1.000 4.755

5 1 0.714 5.498

2 4 0.200

Although hasthehighestaccuracyamongthethreecandidates,theconjunct hasthehighestFOIL'sinformationgain.Thus,itischosentoextendtherule(seeFigure4.2 ).

FOIL'sinformationgain=p1×(log2p1p1+n1−log2p0p0+n0). (4.4)

{}→Mammals

Skincover=hair,Bodytemperature=warmHaslegs=No

p1 n1

{SkinCover=hair}→mammals

{Bodytemperature=wam}→mammals

{Haslegs=No}→mammals −0.737

Skincover=hairBodytemperature=warm

Thisprocesscontinuesuntiladdingnewconjunctsnolongerimprovestheinformationgainmeasure.

RulePruning

TherulesgeneratedbytheLearn-One-Rulefunctioncanbeprunedtoimprovetheirgeneralizationerrors.RIPPERprunestherulesbasedontheirperformanceonthevalidationset.Thefollowingmetriciscomputedtodeterminewhetherpruningisneeded: ,wherep(n)isthenumberofpositive(negative)examplesinthevalidationsetcoveredbytherule.Thismetricismonotonicallyrelatedtotherule'saccuracyonthevalidationset.Ifthemetricimprovesafterpruning,thentheconjunctisremoved.Pruningisdonestartingfromthelastconjunctaddedtotherule.Forexample,givenarule ,RIPPERcheckswhetherDshouldbeprunedfirst,followedbyCD,BCD,etc.Whiletheoriginalrulecoversonlypositiveexamples,theprunedrulemaycoversomeofthenegativeexamplesinthetrainingset.

BuildingtheRuleSet

Aftergeneratingarule,allthepositiveandnegativeexamplescoveredbytheruleareeliminated.Theruleisthenaddedintotherulesetaslongasitdoesnotviolatethestoppingcondition,whichisbasedontheminimumdescriptionlengthprinciple.Ifthenewruleincreasesthetotaldescriptionlengthoftherulesetbyatleastdbits,thenRIPPERstopsaddingrulesintoitsruleset(bydefault,dischosentobe64bits).AnotherstoppingconditionusedbyRIPPERisthattheerrorrateoftheruleonthevalidationsetmustnotexceed50%.

(p−n)/(p+n)

ABCD→y

Figure4.2.General-to-specificandspecific-to-generalrule-growingstrategies.

RIPPERalsoperformsadditionaloptimizationstepstodeterminewhethersomeoftheexistingrulesintherulesetcanbereplacedbybetteralternativerules.Readerswhoareinterestedinthedetailsoftheoptimizationmethodmayrefertothereferencecitedattheendofthischapter.

InstanceElimination

Afteraruleisextracted,RIPPEReliminatesthepositiveandnegativeexamplescoveredbytherule.Therationalefordoingthisisillustratedinthenextexample.

Figure4.3 showsthreepossiblerules,R1,R2,andR3,extractedfromatrainingsetthatcontains29positiveexamplesand21negativeexamples.TheaccuraciesofR1,R2,andR3are12/15(80%),7/10(70%),and8/12(66.7%),respectively.R1isgeneratedfirstbecauseithasthehighestaccuracy.AftergeneratingR1,thealgorithmmustremovetheexamplescoveredbytherulesothatthenextrulegeneratedbythealgorithmisdifferentthanR1.Thequestionis,shoulditremovethepositiveexamplesonly,negativeexamplesonly,orboth?Toanswerthis,supposethealgorithmmustchoosebetweengeneratingR2orR3afterR1.EventhoughR2hasahigheraccuracythanR3(70%versus66.7%),observethattheregioncoveredbyR2isdisjointfromR1,whiletheregioncoveredbyR3overlapswithR1.Asaresult,R1andR3togethercover18positiveand5negativeexamples(resultinginanoverallaccuracyof78.3%),whereasR1andR2togethercover19positiveand6negativeexamples(resultinginaloweroverallaccuracyof76%).IfthepositiveexamplescoveredbyR1arenotremoved,thenwemayoverestimatetheeffectiveaccuracyofR3.IfthenegativeexamplescoveredbyR1arenotremoved,thenwemayunderestimatetheaccuracyofR3.Inthelattercase,wemightenduppreferringR2overR3eventhoughhalfofthefalsepositiveerrorscommittedbyR3havealreadybeenaccountedforbytheprecedingrule,R1.ThisexampleshowsthattheeffectiveaccuracyafteraddingR2orR3totherulesetbecomesevidentonlywhenbothpositiveandnegativeexamplescoveredbyR1areremoved.

Figure4.3.Eliminationoftraininginstancesbythesequentialcoveringalgorithm.R1,R2,andR3representregionscoveredbythreedifferentrules.

4.2.4IndirectMethodsforRuleExtraction

Thissectionpresentsamethodforgeneratingarulesetfromadecisiontree.Inprinciple,everypathfromtherootnodetotheleafnodeofadecisiontreecanbeexpressedasaclassificationrule.Thetestconditionsencounteredalongthepathformtheconjunctsoftheruleantecedent,whiletheclasslabelattheleafnodeisassignedtotheruleconsequent.Figure4.4 showsanexampleofarulesetgeneratedfromadecisiontree.Noticethattherulesetisexhaustiveandcontainsmutuallyexclusiverules.However,someoftherulescanbesimplifiedasshowninthenextexample.

Figure4.4.Convertingadecisiontreeintoclassificationrules.

Example4.3.ConsiderthefollowingthreerulesfromFigure4.4 :

ObservethattherulesetalwayspredictsapositiveclasswhenthevalueofQisYes.Therefore,wemaysimplifytherulesasfollows:

isretainedtocovertheremaininginstancesofthepositiveclass.Althoughtherulesobtainedaftersimplificationarenolongermutuallyexclusive,theyarelesscomplexandareeasiertointerpret.

Inthefollowing,wedescribeanapproachusedbytheC4.5rulesalgorithmtogeneratearulesetfromadecisiontree.Figure4.5 showsthedecisiontree

r2:(P=No)∧(Q=Yes)→+r3:(P=Yes)∧(R=No)→+r5:(P=Yes)∧(R=Yes)∧(Q=Yes)→+.

r2′:(Q=Yes)→+r3:(P=Yes)∧(R=No)→+.

r3

andresultingclassificationrulesobtainedforthedatasetgiveninTable4.2 .

RuleGeneration

Classificationrulesareextractedforeverypathfromtheroottooneoftheleafnodesinthedecisiontree.Givenaclassificationrule ,weconsiderasimplifiedrule, where isobtainedbyremovingoneoftheconjunctsinA.Thesimplifiedrulewiththelowestpessimisticerrorrateisretainedprovideditserrorrateislessthanthatoftheoriginalrule.Therule-pruningstepisrepeateduntilthepessimisticerroroftherulecannotbeimprovedfurther.Becausesomeoftherulesmaybecomeidenticalafterpruning,theduplicaterulesarediscarded.

Figure4.5.

r:A→yr′:A′→y A′

Classificationrulesextractedfromadecisiontreeforthevertebrateclassificationproblem.

RuleOrdering

Aftergeneratingtheruleset,C4.5rulesusestheclass-basedorderingschemetoordertheextractedrules.Rulesthatpredictthesameclassaregroupedtogetherintothesamesubset.Thetotaldescriptionlengthforeachsubsetiscomputed,andtheclassesarearrangedinincreasingorderoftheirtotaldescriptionlength.Theclassthathasthesmallestdescriptionlengthisgiventhehighestprioritybecauseitisexpectedtocontainthebestsetofrules.Thetotaldescriptionlengthforaclassisgivenby ,where

isthenumberofbitsneededtoencodethemisclassifiedexamples,Lmodelisthenumberofbitsneededtoencodethemodel,andgisatuningparameterwhosedefaultvalueis0.5.Thetuningparameterdependsonthenumberofredundantattributespresentinthemodel.Thevalueofthetuningparameterissmallifthemodelcontainsmanyredundantattributes.

4.2.5CharacteristicsofRule-BasedClassifiers

1. Rule-basedclassifiershaveverysimilarcharacteristicsasdecisiontrees.Theexpressivenessofarulesetisalmostequivalenttothatofadecisiontreebecauseadecisiontreecanberepresentedbyasetofmutuallyexclusiveandexhaustiverules.Bothrule-basedanddecisiontreeclassifierscreaterectilinearpartitionsoftheattributespaceandassignaclasstoeachpartition.However,arule-basedclassifiercan

Lexception+g×LmodelLexception

allowmultiplerulestobetriggeredforagiveninstance,thusenablingthelearningofmorecomplexmodelsthandecisiontrees.

2. Likedecisiontrees,rule-basedclassifierscanhandlevaryingtypesofcategoricalandcontinuousattributesandcaneasilyworkinmulticlassclassificationscenarios.Rule-basedclassifiersaregenerallyusedtoproducedescriptivemodelsthatareeasiertointerpretbutgivecomparableperformancetothedecisiontreeclassifier.

3. Rule-basedclassifierscaneasilyhandlethepresenceofredundantattributesthatarehighlycorrelatedwithoneother.Thisisbecauseonceanattributehasbeenusedasaconjunctinaruleantecedent,theremainingredundantattributeswouldshowlittletonoFOIL'sinformationgainandwouldthusbeignored.

4. Sinceirrelevantattributesshowpoorinformationgain,rule-basedclassifierscanavoidselectingirrelevantattributesifthereareotherrelevantattributesthatshowbetterinformationgain.However,iftheproblemiscomplexandthereareinteractingattributesthatcancollectivelydistinguishbetweentheclassesbutindividuallyshowpoorinformationgain,itislikelyforanirrelevantattributetobeaccidentallyfavoredoverarelevantattributejustbyrandomchance.Hence,rule-basedclassifierscanshowpoorperformanceinthepresenceofinteractingattributes,whenthenumberofirrelevantattributesislarge.

5. Theclass-basedorderingstrategyadoptedbyRIPPER,whichemphasizesgivinghigherprioritytorareclasses,iswellsuitedforhandlingtrainingdatasetswithimbalancedclassdistributions.

6. Rule-basedclassifiersarenotwell-suitedforhandlingmissingvaluesinthetestset.Thisisbecausethepositionofrulesinarulesetfollowsacertainorderingstrategyandevenifatestinstanceiscoveredbymultiplerules,theycanassigndifferentclasslabelsdependingontheirpositionintheruleset.Hence,ifacertainruleinvolvesanattributethatismissinginatestinstance,itisdifficulttoignoretheruleandproceed

tothesubsequentrulesintheruleset,assuchastrategycanresultinincorrectclassassignments.

4.3NearestNeighborClassifiersTheclassificationframeworkshowninFigure3.3 involvesatwo-stepprocess:

(1)aninductivestepforconstructingaclassificationmodelfromdata,and

(2)adeductivestepforapplyingthemodeltotestexamples.Decisiontreeandrule-basedclassifiersareexamplesofeagerlearnersbecausetheyaredesignedtolearnamodelthatmapstheinputattributestotheclasslabelassoonasthetrainingdatabecomesavailable.Anoppositestrategywouldbetodelaytheprocessofmodelingthetrainingdatauntilitisneededtoclassifythetestinstances.Techniquesthatemploythisstrategyareknownaslazylearners.AnexampleofalazylearneristheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofatestinstancematchoneofthetrainingexamplesexactly.Anobviousdrawbackofthisapproachisthatsometestinstancesmaynotbeclassifiedbecausetheydonotmatchanytrainingexample.

Onewaytomakethisapproachmoreflexibleistofindallthetrainingexamplesthatarerelativelysimilartotheattributesofthetestinstances.Theseexamples,whichareknownasnearestneighbors,canbeusedtodeterminetheclasslabelofthetestinstance.Thejustificationforusingnearestneighborsisbestexemplifiedbythefollowingsaying:“Ifitwalkslikeaduck,quackslikeaduck,andlookslikeaduck,thenit'sprobablyaduck.”Anearestneighborclassifierrepresentseachexampleasadatapointinad-dimensionalspace,wheredisthenumberofattributes.Givenatestinstance,wecomputeitsproximitytothetraininginstancesaccordingtooneoftheproximitymeasuresdescribedinSection2.4 onpage71.Thek-nearest

neighborsofagiventestinstancezrefertothektrainingexamplesthatareclosesttoz.

Figure4.6 illustratesthe1-,2-,and3-nearestneighborsofatestinstancelocatedatthecenterofeachcircle.Theinstanceisclassifiedbasedontheclasslabelsofitsneighbors.Inthecasewheretheneighborshavemorethanonelabel,thetestinstanceisassignedtothemajorityclassofitsnearestneighbors.InFigure4.6(a) ,the1-nearestneighboroftheinstanceisanegativeexample.Thereforetheinstanceisassignedtothenegativeclass.Ifthenumberofnearestneighborsisthree,asshowninFigure4.6(c) ,thentheneighborhoodcontainstwopositiveexamplesandonenegativeexample.Usingthemajorityvotingscheme,theinstanceisassignedtothepositiveclass.Inthecasewherethereisatiebetweentheclasses(seeFigure4.6(b) ),wemayrandomlychooseoneofthemtoclassifythedatapoint.

Figure4.6.The1-,2-,and3-nearestneighborsofaninstance.

Theprecedingdiscussionunderscorestheimportanceofchoosingtherightvaluefork.Ifkistoosmall,thenthenearestneighborclassifiermaybesusceptibletooverfittingduetonoise,i.e.,mislabeledexamplesinthetraining

data.Ontheotherhand,ifkistoolarge,thenearestneighborclassifiermaymisclassifythetestinstancebecauseitslistofnearestneighborsincludestrainingexamplesthatarelocatedfarawayfromitsneighborhood(seeFigure4.7 ).

Figure4.7.k-nearestneighborclassificationwithlargek.

4.3.1Algorithm

Ahigh-levelsummaryofthenearestneighborclassificationmethodisgiveninAlgorithm4.2 .Thealgorithmcomputesthedistance(orsimilarity)betweeneachtestinstance andallthetrainingexamples todetermineitsnearestneighborlist, .Suchcomputationcanbecostlyifthenumberoftrainingexamplesislarge.However,efficientindexingtechniquesareavailabletoreducethecomputationneededtofindthenearestneighborsofatestinstance.

z=(x′,y′) (x,y)∈DDz

Algorithm4.2Thek-nearestneighborclassifier.

′ ′

′ ∑ ∈

Oncethenearestneighborlistisobtained,thetestinstanceisclassifiedbasedonthemajorityclassofitsnearestneighbors:

wherevisaclasslabel, istheclasslabelforoneofthenearestneighbors,and isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.

Inthemajorityvotingapproach,everyneighborhasthesameimpactontheclassification.Thismakesthealgorithmsensitivetothechoiceofk,asshowninFigure4.6 .Onewaytoreducetheimpactofkistoweighttheinfluenceofeachnearestneighbor accordingtoitsdistance: .Asaresult,trainingexamplesthatarelocatedfarawayfromzhaveaweakerimpactontheclassificationcomparedtothosethatarelocatedclosetoz.Usingthedistance-weightedvotingscheme,theclasslabelcanbedeterminedasfollows:

MajorityVoting:y′=argmaxv∑(xi,yi)∈DzI(v=yi), (4.5)

yiI(⋅)

xi wi=1/d(x′,xi)2

Distance-WeightedVoting:y′=argmaxv∑(xi,yi)∈Dzwi×I(v=yi). (4.6)

4.3.2CharacteristicsofNearestNeighborClassifiers

1. Nearestneighborclassificationispartofamoregeneraltechniqueknownasinstance-basedlearning,whichdoesnotbuildaglobalmodel,butratherusesthetrainingexamplestomakepredictionsforatestinstance.(Thus,suchclassifiersareoftensaidtobe“modelfree.”)Suchalgorithmsrequireaproximitymeasuretodeterminethesimilarityordistancebetweeninstancesandaclassificationfunctionthatreturnsthepredictedclassofatestinstancebasedonitsproximitytootherinstances.

2. Althoughlazylearners,suchasnearestneighborclassifiers,donotrequiremodelbuilding,classifyingatestinstancecanbequiteexpensivebecauseweneedtocomputetheproximityvaluesindividuallybetweenthetestandtrainingexamples.Incontrast,eagerlearnersoftenspendthebulkoftheircomputingresourcesformodelbuilding.Onceamodelhasbeenbuilt,classifyingatestinstanceisextremelyfast.

3. Nearestneighborclassifiersmaketheirpredictionsbasedonlocalinformation.(Thisisequivalenttobuildingalocalmodelforeachtestinstance.)Bycontrast,decisiontreeandrule-basedclassifiersattempttofindaglobalmodelthatfitstheentireinputspace.Becausetheclassificationdecisionsaremadelocally,nearestneighborclassifiers(withsmallvaluesofk)arequitesusceptibletonoise.

4. Nearestneighborclassifierscanproducedecisionboundariesofarbitraryshape.Suchboundariesprovideamoreflexiblemodelrepresentationcomparedtodecisiontreeandrule-basedclassifiersthatareoftenconstrainedtorectilineardecisionboundaries.Thedecisionboundariesofnearestneighborclassifiersalsohavehigh

variabilitybecausetheydependonthecompositionoftrainingexamplesinthelocalneighborhood.Increasingthenumberofnearestneighborsmayreducesuchvariability.

5. Nearestneighborclassifiershavedifficultyhandlingmissingvaluesinboththetrainingandtestsetssinceproximitycomputationsnormallyrequirethepresenceofallattributes.Although,thesubsetofattributespresentintwoinstancescanbeusedtocomputeaproximity,suchanapproachmaynotproducegoodresultssincetheproximitymeasuresmaybedifferentforeachpairofinstancesandthushardtocompare.

6. Nearestneighborclassifierscanhandlethepresenceofinteractingattributes,i.e.,attributesthathavemorepredictivepowertakenincombinationthenbythemselves,byusingappropriateproximitymeasuresthatcanincorporatetheeffectsofmultipleattributestogether.

7. Thepresenceofirrelevantattributescandistortcommonlyusedproximitymeasures,especiallywhenthenumberofirrelevantattributesislarge.Furthermore,iftherearealargenumberofredundantattributesthatarehighlycorrelatedwitheachother,thentheproximitymeasurecanbeoverlybiasedtowardsuchattributes,resultinginimproperestimatesofdistance.Hence,thepresenceofirrelevantandredundantattributescanadverselyaffecttheperformanceofnearestneighborclassifiers.

8. Nearestneighborclassifierscanproducewrongpredictionsunlesstheappropriateproximitymeasureanddatapreprocessingstepsaretaken.Forexample,supposewewanttoclassifyagroupofpeoplebasedonattributessuchasheight(measuredinmeters)andweight(measuredinpounds).Theheightattributehasalowvariability,rangingfrom1.5mto1.85m,whereastheweightattributemayvaryfrom90lb.to250lb.Ifthescaleoftheattributesarenottakenintoconsideration,theproximitymeasuremaybedominatedbydifferencesintheweightsofaperson.

4.4NaïveBayesClassifierManyclassificationproblemsinvolveuncertainty.First,theobservedattributesandclasslabelsmaybeunreliableduetoimperfectionsinthemeasurementprocess,e.g.,duetothelimitedprecisenessofsensordevices.Second,thesetofattributeschosenforclassificationmaynotbefullyrepresentativeofthetargetclass,resultinginuncertainpredictions.Toillustratethis,considertheproblemofpredictingaperson'sriskforheartdiseasebasedonamodelthatusestheirdietandworkoutfrequencyasattributes.Althoughmostpeoplewhoeathealthilyandexerciseregularlyhavelesschanceofdevelopingheartdisease,theymaystillbeatriskduetootherlatentfactors,suchasheredity,excessivesmoking,andalcoholabuse,thatarenotcapturedinthemodel.Third,aclassificationmodellearnedoverafinitetrainingsetmaynotbeabletofullycapturethetruerelationshipsintheoveralldata,asdiscussedinthecontextofmodeloverfittinginthepreviouschapter.Finally,uncertaintyinpredictionsmayariseduetotheinherentrandomnatureofreal-worldsystems,suchasthoseencounteredinweatherforecastingproblems.

Inthepresenceofuncertainty,thereisaneedtonotonlymakepredictionsofclasslabelsbutalsoprovideameasureofconfidenceassociatedwitheveryprediction.Probabilitytheoryoffersasystematicwayforquantifyingandmanipulatinguncertaintyindata,andthus,isanappealingframeworkforassessingtheconfidenceofpredictions.Classificationmodelsthatmakeuseofprobabilitytheorytorepresenttherelationshipbetweenattributesandclasslabelsareknownasprobabilisticclassificationmodels.Inthissection,wepresentthenaïveBayesclassifier,whichisoneofthesimplestandmostwidely-usedprobabilisticclassificationmodels.

4.4.1BasicsofProbabilityTheory

BeforewediscusshowthenaïveBayesclassifierworks,wefirstintroducesomebasicsofprobabilitytheorythatwillbeusefulinunderstandingtheprobabilisticclassificationmodelspresentedinthischapter.Thisinvolvesdefiningthenotionofprobabilityandintroducingsomecommonapproachesformanipulatingprobabilityvalues.

ConsideravariableX,whichcantakeanydiscretevaluefromtheset.Whenwehavemultipleobservationsofthatvariable,suchasina

datasetwherethevariabledescribessomecharacteristicofdataobjects,thenwecancomputetherelativefrequencywithwhicheachvalueoccurs.Specifically,supposethatXhasthevalue for dataobjects.Therelativefrequencywithwhichweobservetheevent isthen ,whereNdenotesthetotalnumberofoccurrences( ).TheserelativefrequenciescharacterizetheuncertaintythatwehavewithrespecttowhatvalueXmaytakeforanunseenobservationandmotivatesthenotionofprobability.

Moreformally,theprobabilityofanevente,e.g., ,measureshowlikelyitisfortheeventetooccur.Themosttraditionalviewofprobabilityisbasedonrelativefrequencyofevents(frequentist),whiletheBayesianviewpoint(describedlater)takesamoreflexibleviewofprobabilities.Ineithercase,aprobabilityisalwaysanumberbetween0and1.Further,thesumofprobabilityvaluesofallpossibleevents,e.g.,outcomesofavariableXisequalto1.Variablesthathaveprobabilitiesassociatedwitheachpossibleoutcome(values)areknownasrandomvariables.

Now,letusconsidertworandomvariables,XandY,thatcaneachtakekdiscretevalues.Let bethenumberoftimesweobserve and ,out

{x1,…,xk}

xi niX=xi ni/N

N=∑i=1kni

P(X=xi)

nij X=xi Y=yj

ofatotalnumberofNoccurrences.Thejointprobabilityofobservingand togethercanbeestimatedas

(Thisisanestimatesincewetypicallyhaveonlyafinitesubsetofallpossibleobservations.)Jointprobabilitiescanbeusedtoanswerquestionssuchas“whatistheprobabilitythattherewillbeasurprisequiztoday Iwillbelatefortheclass.”Jointprobabilitiesaresymmetric,i.e.,

.Forjointprobabilities,itistousefultoconsidertheirsumwithrespecttooneoftherandomvariables,asdescribedinthefollowingequation:

where isthetotalnumberoftimesweobserve irrespectiveofthevalueofY.Noticethat isessentiallytheprobabilityofobserving .Hence,bysummingoutthejointprobabilitieswithrespecttoarandomvariableY,weobtaintheprobabilityofobservingtheremainingvariableX.Thisoperationiscalledmarginalizationandtheprobabilityvalue obtainedbymarginalizingoutYissometimescalledthemarginalprobabilityofX.Aswewillseelater,jointprobabilityandmarginalprobabilityformthebasicbuildingblocksofanumberofprobabilisticclassificationmodelsdiscussedinthischapter.

Noticethatinthepreviousdiscussions,weused todenotetheprobabilityofaparticularoutcomeofarandomvariableX.Thisnotationcaneasilybecomecumbersomewhenanumberofrandomvariablesareinvolved.Hence,intheremainderofthissection,wewilluseP(X)todenotetheprobabilityofanygenericoutcomeoftherandomvariableX,while willbeusedtorepresenttheprobabilityofthespecificoutcome .

X=xiY=yj

P(X=xi,Y=yi)=nijN. (4.7)

P(X=x,Y=y)=P(Y=y,X=x)

∑j=1kP(X=xi,Y=yj)=∑j=1knijN=niN=P(X=xi), (4.8)

ni X=xini/N X=xi

P(X=xi)

P(X=xi)

P(xi)xi

BayesTheoremSupposeyouhaveinvitedtwoofyourfriendsAlexandMarthatoadinnerparty.YouknowthatAlex

attends40%ofthepartiesheisinvitedto.Further,ifAlexisgoingtoaparty,thereisan80%chance

ofMarthacomingalong.Ontheotherhand,ifAlexisnotgoingtotheparty,thechanceofMartha

comingtothepartyisreducedto30%.IfMarthahasrespondedthatshewillbecomingtoyourparty,

whatistheprobabilitythatAlexwillalsobecoming?

Bayestheorempresentsthestatisticalprincipleforansweringquestionslikethepreviousone,whereevidencefrommultiplesourceshastobecombinedwithpriorbeliefstoarriveatpredictions.Bayestheoremcanbebrieflydescribedasfollows.

Let denotetheconditionalprobabilityofobservingtherandomvariableYwhenevertherandomvariableXtakesaparticularvalue. isoftenreadastheprobabilityofobservingYconditionedontheoutcomeofX.Conditionalprobabilitiescanbeusedforansweringquestionssuchas“giventhatitisgoingtoraintoday,whatwillbetheprobabilitythatIwillgototheclass.”ConditionalprobabilitiesofXandYarerelatedtotheirjoint

probabilityinthefollowingway:

RearrangingthelasttwoexpressionsinEquation4.10 leadstoEquation4.11 ,whichisknownasBayestheorem:

P(Y|X)P(Y|X)

P(Y|X)=P(X,Y)P(X),whichimplies (4.9)

P(X,Y)=P(Y|X)×P(X)=P(X|Y)×P(Y). (4.10)

P(Y|X)=P(X|Y)P(Y)P(X). (4.11)

Bayestheoremprovidesarelationshipbetweentheconditionalprobabilitiesand .NotethatthedenominatorinEquation4.11 involvesthe

marginalprobabilityofX,whichcanalsoberepresentedas

UsingthepreviousexpressionforP(X),wecanobtainthefollowingequationfor solelyintermsof andP(Y):

Example4.4.[BayesTheorem]Bayestheoremcanbeusedtosolveanumberofinferentialquestionsaboutrandomvariables.Forexample,considertheproblemstatedatthebeginningoninferringwhetherAlexwillcometotheparty.LetdenotetheprobabilityofAlexgoingtoaparty,while denotestheprobabilityofhimnotgoingtoaparty.Weknowthat

Further,let denotetheconditionalprobabilityofMarthagoingtoapartyconditionedonwhetherAlexisgoingtotheparty. takesthefollowingvalues:

Wecanusetheabovevaluesof andP(A)tocomputetheprobabilityofAlexgoingtothepartygivenMarthaisgoingtotheparty,

,asfollows:

P(Y|X) P(X|Y)

P(X)=∑i=1kP(X,yi)=∑i=1kP(X|yi)×P(yi).

P(Y|X) P(X|Y)

P(Y|X)=P(X|Y)P(Y)∑i−1kP(X|yi)P(yi). (4.12)

P(A=1)P(A=0)

P(A=1)=0.4,andP(A=0)=1−P(A=1)=0.6.

P(M=1|A)P(M=1|A)

P(M=1|A=1)=0.8,andP(M=1|A=0)=0.3.

P(M|A)

P(A=1|M=1)

NoticethateventhoughthepriorprobabilityP(A)ofAlexgoingtothepartyislow,theobservationthatMarthaisgoing, ,affectstheconditionalprobability .ThisshowsthevalueofBayestheoremincombiningpriorassumptionswithobservedoutcomestomakepredictions.Since ,itismorelikelyforAlextojoinifMarthaisgoingtotheparty.

UsingBayesTheoremforClassificationForthepurposeofclassification,weareinterestedincomputingtheprobabilityofobservingaclasslabelyforadatainstancegivenitssetofattributevalues .Thiscanberepresentedas ,whichisknownastheposteriorprobabilityofthetargetclass.UsingtheBayesTheorem,wecanrepresenttheposteriorprobabilityas

Notethatthenumeratorofthepreviousequationinvolvestwoterms,andP(y),bothofwhichcontributetotheposteriorprobability .Wedescribebothofthesetermsinthefollowing.

Thefirstterm isknownastheclass-conditionalprobabilityoftheattributesgiventheclasslabel. measuresthelikelihoodofobservingfromthedistributionofinstancesbelongingtoy.If indeedbelongstoclassy,thenweshouldexpect tobehigh.Fromthispointofview,theuseofclass-conditionalprobabilitiesattemptstocapturetheprocessfromwhichthedatainstancesweregenerated.Becauseofthisinterpretation,probabilisticclassificationmodelsthatinvolvecomputingclass-conditionalprobabilitiesare

P(A=1|M=1)=P(M=1|A=1)P(A=1)P(M=1|A=0)P(A=0)+P(M=1|A=1)P(A=1),=0.8(4.13)

M=1P(A=1|M=1)

P(A=1|M=1)>0.5

P(y|x)

P(y|x)=P(x|y)P(y)P(x) (4.14)

P(x|y)P(y|x)

P(x|y)P(x|y)

P(x|y)

knownasgenerativeclassificationmodels.Apartfromtheiruseincomputingposteriorprobabilitiesandmakingpredictions,class-conditionalprobabilitiesalsoprovideinsightsabouttheunderlyingmechanismbehindthegenerationofattributevalues.

ThesecondterminthenumeratorofEquation4.14 isthepriorprobabilityP(y).Thepriorprobabilitycapturesourpriorbeliefsaboutthedistributionofclasslabels,independentoftheobservedattributevalues.(ThisistheBayesianviewpoint.)Forexample,wemayhaveapriorbeliefthatthelikelihoodofanypersontosufferfromaheartdiseaseis ,irrespectiveoftheirdiagnosisreports.Thepriorprobabilitycaneitherbeobtainedusingexpertknowledge,orinferredfromhistoricaldistributionofclasslabels.

ThedenominatorinEquation4.14 involvestheprobabilityofevidence,P( ).Notethatthistermdoesnotdependontheclasslabelandthuscanbetreatedasanormalizationconstantinthecomputationofposteriorprobabilities.Further,thevalueofP( )canbecalculatedas

.

Bayestheoremprovidesaconvenientwaytocombineourpriorbeliefswiththelikelihoodofobtainingtheobservedattributevalues.Duringthetrainingphase,wearerequiredtolearntheparametersforP(y)and .ThepriorprobabilityP(y)canbeeasilyestimatedfromthetrainingsetbycomputingthefractionoftraininginstancesthatbelongtoeachclass.Tocomputetheclass-conditionalprobabilities,oneapproachistoconsiderthefractionoftraininginstancesofagivenclassforeverypossiblecombinationofattributevalues.Forexample,supposethattherearetwoattributes and thatcaneachtakeadiscretevaluefrom to .Let denotethenumberoftraininginstancesbelongingtoclass0,outofwhich numberoftraininginstanceshave and .Theclass-conditionalprobabilitycanthenbegivenas

α

P(x)=∑iP(x|yi)P(yi)

P(x|y)

X1 X2c1 ck n0

nij0X1=ci X2=cj

Thisapproachcaneasilybecomecomputationallyprohibitiveasthenumberofattributesincrease,duetotheexponentialgrowthinthenumberofattributevaluecombinations.Forexample,ifeveryattributecantakekdiscretevalues,thenthenumberofattributevaluecombinationsisequalto ,wheredisthenumberofattributes.Thelargenumberofattributevaluecombinationscanalsoresultinpoorestimatesofclass-conditionalprobabilities,sinceeverycombinationwillhavefewertraininginstanceswhenthesizeoftrainingsetissmall.

Inthefollowing,wepresentthenaïveBayesclassifier,whichmakesasimplifyingassumptionabouttheclass-conditionalprobabilities,knownasthenaïveBayesassumption.Theuseofthisassumptionsignificantlyhelpsinobtainingreliableestimatesofclass-conditionalprobabilities,evenwhenthenumberofattributesarelarge.

4.4.2NaïveBayesAssumption

ThenaïveBayesclassifierassumesthattheclass-conditionalprobabilityofallattributes canbefactoredasaproductofclass-conditionalprobabilitiesofeveryattribute ,asdescribedinthefollowingequation:

whereeverydatainstance consistsofdattributes, .Thebasicassumptionbehindthepreviousequationisthattheattributevaluesareconditionallyindependentofeachother,giventheclasslabely.Thismeansthattheattributesareinfluencedonlybythetargetclassandifwe

P(X1=ci,X2=cj|Y=0)=nij0n0.

kd

xi

P(x|y)=∏i=1dP(xi|y), (4.15)

{x1,x2,…,xd}xi

knowtheclasslabel,thenwecanconsidertheattributestobeindependentofeachother.Theconceptofconditionalindependencecanbeformallystatedasfollows.

ConditionalIndependenceLet ,andYdenotethreesetsofrandomvariables.Thevariablesinaresaidtobeconditionallyindependentof ,givenY,ifthefollowingconditionholds:

ThismeansthatconditionedonY,thedistributionof isnotinfluencedbytheoutcomesof ,andhenceisconditionallyindependentof .Toillustratethenotionofconditionalindependence,considertherelationshipbetweenaperson'sarmlength andhisorherreadingskills .Onemightobservethatpeoplewithlongerarmstendtohavehigherlevelsofreadingskills,andthusconsider and toberelatedtoeachother.However,thisrelationshipcanbeexplainedbyanotherfactor,whichistheageoftheperson(Y).Ayoungchildtendstohaveshortarmsandlacksthereadingskillsofanadult.Iftheageofapersonisfixed,thentheobservedrelationshipbetweenarmlengthandreadingskillsdisappears.Thus,wecanconcludethatarmlengthandreadingskillsarenotdirectlyrelatedtoeachotherandareconditionallyindependentwhentheagevariableisfixed.

Anotherwayofdescribingconditionalindependenceistoconsiderthejointconditionalprobability, ,asfollows:

X1,X2, X1X2

P(X1|X2,Y)=P(X1|Y). (4.16)

X1X2 X2

(X1) (X2)

X1 X2

P(X1,X2|Y)

P(X1,X2|Y)=P(X1,X2,Y)P(Y)=P(X1,X2,Y)P(X2,Y)×P(X2,Y)P(Y)=P(X1|X2,Y(4.17)

whereEquation4.16 wasusedtoobtainthelastlineofEquation4.17 .Thepreviousdescriptionofconditionalindependenceisquiteusefulfromanoperationalperspective.Itstatesthatthejointconditionalprobabilityof and

givenYcanbefactoredastheproductofconditionalprobabilitiesofand consideredseparately.ThisformsthebasisofthenaïveBayesassumptionstatedinEquation4.15 .

HowaNaïveBayesClassifierWorksUsingthenaïveBayesassumption,weonlyneedtoestimatetheconditionalprobabilityofeach givenYseparately,insteadofcomputingtheclass-conditionalprobabilityforeverycombinationofattributevalues.Forexample,if and denotethenumberoftraininginstancesbelongingtoclass0with and ,respectively,thentheclass-conditionalprobabilitycanbeestimatedas

Inthepreviousequation,weonlyneedtocountthenumberoftraininginstancesforeveryoneofthekvaluesofanattributeX,irrespectiveofthevaluesofotherattributes.Hence,thenumberofparametersneededtolearnclass-conditionalprobabilitiesisreducedfrom todk.Thisgreatlysimplifiestheexpressionfortheclass-conditionalprobabilityandmakesitmoreamenabletolearningparametersandmakingpredictions,eveninhigh-dimensionalsettings.

ThenaïveBayesclassifiercomputestheposteriorprobabilityforatestinstance byusingthefollowingequation:

X1X2 X1

X2

xi

ni0 nj0X1=ci X2=cj

P(X1=ci,X2=xj|Y=0)=ni0n0×nj0n0.

dk

P(y|x)=P(y)∏i=1dP(xi|y)P(x) (4.18)

SinceP( )isfixedforeveryyandonlyactsasanormalizingconstanttoensurethat ,wecanwrite

Hence,itissufficienttochoosetheclassthatmaximizes .

OneoftheusefulpropertiesofthenaïveBayesclassifieristhatitcaneasilyworkwithincompleteinformationaboutdatainstances,whenonlyasubsetofattributesareobservedateveryinstance.Forexample,ifweonlyobservepoutofdattributesatadatainstance,thenwecanstillcompute

usingthosepattributesandchoosetheclasswiththemaximumvalue.ThenaïveBayesclassifiercanthusnaturallyhandlemissingvaluesintestinstances.Infact,intheextremecasewherenoattributesareobserved,wecanstillusethepriorprobabilityP(y)asanestimateoftheposteriorprobability.Asweobservemoreattributes,wecankeeprefiningtheposteriorprobabilitytobetterreflectthelikelihoodofobservingthedatainstance.

Inthenexttwosubsections,wedescribeseveralapproachesforestimatingtheconditionalprobabilities forcategoricalandcontinuousattributesfromthetrainingset.

EstimatingConditionalProbabilitiesforCategoricalAttributesForacategoricalattribute ,theconditionalprobability isestimatedaccordingtothefractionoftraininginstancesinclassywhere takesonaparticularcategoricalvaluec.

P(y|x)∈[0,1]

P(y|x)∝P(y)∏i=1dP(xi|y).

P(y)∏i=1dP(xi|y)

P(y)∏i=1pP(xi|y)

P(xi|y)

Xi P(Xi=c|y)Xi

wherenisthenumberoftraininginstancesbelongingtoclassy,outofwhichnumberofinstanceshave .Forexample,inthetrainingsetgivenin

Figure4.8 ,sevenpeoplehavetheclasslabel ,outofwhichthreepeoplehave whiletheremainingfourhave

.Asaresult,theconditionalprobabilityforisequalto3/7.Similarly,the

conditionalprobabilityfordefaultedborrowerswith isgivenby .Notethatthesumofconditionalprobabilitiesoverallpossibleoutcomesof isequaltoone,i.e., .

Figure4.8.Trainingsetforpredictingtheloandefaultproblem.

P(Xi=c|y)=ncn,

nc Xi=cDefaultedBorrower=No

HomeOwner=YesHomeOwner=NoP(HomeOwner=Yes|DefaultedBorrower=No)

MaritalStatus=SingleP(MaritalStatus=Single|DefaultedBorrower=Yes)=2/3

Xi∑cP(Xi=c|y)=1,

EstimatingConditionalProbabilitiesforContinuousAttributesTherearetwowaystoestimatetheclass-conditionalprobabilitiesforcontinuousattributes:

1. Wecandiscretizeeachcontinuousattributeandthenreplacethecontinuousvalueswiththeircorrespondingdiscreteintervals.Thisapproachtransformsthecontinuousattributesintoordinalattributes,andthesimplemethoddescribedpreviouslyforcomputingtheconditionalprobabilitiesofcategoricalattributescanbeemployed.Notethattheestimationerrorofthismethoddependsonthediscretizationstrategy(asdescribedinSection2.3.6 onpage63),aswellasthenumberofdiscreteintervals.Ifthenumberofintervalsistoolarge,everyintervalmayhaveaninsufficientnumberoftraininginstancestoprovideareliableestimateof .Ontheotherhand,ifthenumberofintervalsistoosmall,thenthediscretizationprocessmaylooseinformationaboutthetruedistributionofcontinuousvalues,andthusresultinpoorpredictions.

2. Wecanassumeacertainformofprobabilitydistributionforthecontinuousvariableandestimatetheparametersofthedistributionusingthetrainingdata.Forexample,wecanuseaGaussiandistributiontorepresenttheconditionalprobabilityofcontinuousattributes.TheGaussiandistributionischaracterizedbytwoparameters,themean, ,andthevariance, .Foreachclass ,theclass-conditionalprobabilityforattribute is

P(Xi|Y)

μ σ2 yj

XiP(Xi=xi|Y=yj)=12πσijexp[−(xi−μij)22σij2]. (4.19)

Theparameter canbeestimatedusingthesamplemeanofforalltraininginstancesthatbelongto .Similarly, canbeestimatedfromthesamplevariance ofsuchtraininginstances.Forexample,considertheannualincomeattributeshowninFigure4.8 .Thesamplemeanandvarianceforthisattributewithrespecttotheclass are

Givenatestinstancewithtaxableincomeequalto$120K,wecanusethefollowingvalueasitsconditionalprobabilitygivenclass :

Example4.5.[NaïveBayesClassifier]ConsiderthedatasetshowninFigure4.9(a) ,wherethetargetclassisDefaultedBorrower,whichcantaketwovaluesYesandNo.Wecancomputetheclass-conditionalprobabilityforeachcategoricalattributeandthesamplemeanandvarianceforthecontinuousattribute,assummarizedinFigure4.9(b) .

Weareinterestedinpredictingtheclasslabelofatestinstance.Todo

this,wefirstcomputethepriorprobabilitiesbycountingthenumberoftraininginstancesbelongingtoeveryclass.Wethusobtain and

.Next,wecancomputetheclass-conditionalprobabilityasfollows:

μij Xi(x¯)yj σij2

(s2)

x¯=125+100+70+…+757=100s2=(125−110)2+(100−110)2+…(75−110)26=2975s=2975=54.54.

P(Income=120|No)=12π(54.54)exp−(120−110)22×2975=0.0072.

x=(HomeOwner=No,MaritalStatus=Married,AnnualIncome=$120K)

P(yes)=0.3P(No)=0.7

Figure4.9.ThenaïveBayesclassifierfortheloanclassificationproblem.

Noticethattheclass-conditionalprobabilityforclass hasbecome0becausetherearenoinstancesbelongingtoclass with

inthetrainingset.Usingtheseclass-conditionalprobabilities,wecanestimatetheposteriorprobabilitiesas

where isanormalizingconstant.Since ,theinstanceisclassifiedas .

P(x|NO)=P(HomeOwner=No|No)×P(Status=Married|No)×P(AnnualIncome

Status=Married

P(No|x)=0.7×0.0024P(x).=0.0016α.P(Yes|x)=0.3×0P(x)=0.

α=1/P(x) P(No|x)>P(Yes|x)

HandlingZeroConditionalProbabilitiesTheprecedingexampleillustratesapotentialproblemwithusingthenaïveBayesassumptioninestimatingclass-conditionalprobabilities.Iftheconditionalprobabilityforanyoftheattributesiszero,thentheentireexpressionfortheclass-conditionalprobabilitybecomeszero.Notethatzeroconditionalprobabilitiesarisewhenthenumberoftraininginstancesissmallandthenumberofpossiblevaluesofanattributeislarge.Insuchcases,itmayhappenthatacombinationofattributevaluesandclasslabelsareneverobserved,resultinginazeroconditionalprobability.

Inamoreextremecase,ifthetraininginstancesdonotcoversomecombinationsofattributevaluesandclasslabels,thenwemaynotbeabletoevenclassifysomeofthetestinstances.Forexample,if

iszeroinsteadof1/7,thenadatainstancewithattributeset

hasthefollowingclass-conditionalprobabilities:

Sinceboththeclass-conditionalprobabilitiesare0,thenaïveBayesclassifierwillnotbeabletoclassifytheinstance.Toaddressthisproblem,itisimportanttoadjusttheconditionalprobabilityestimatessothattheyarenotasbrittleassimplyusingfractionsoftraininginstances.Thiscanbeachievedbyusingthefollowingalternateestimatesofconditionalprobability:

P(MaritalStatus=Divorced|No)x=

(HomeOwner=Yes,MaritalStatus=Divorced,Income=$120K)

P(x|No)=3/7×0×0.0072=0.P(x|Yes)=0×1/3×1.2×10−9=0.

Laplaceestimate:P(Xi=c|y)=nc+1n+v, (4.20)

m-estimate:P(Xi=c|y)=nc+mpn+m, (4.21)

wherenisthenumberoftraininginstancesbelongingtoclassy, isthenumberoftraininginstanceswith and ,visthetotalnumberofattributevaluesthat cantake,pissomeinitialestimateof thatisknownapriori,andmisahyper-parameterthatindicatesourconfidenceinusingpwhenthefractionoftraininginstancesistoobrittle.Notethatevenif

,bothLaplaceandm-estimateprovidenon-zerovaluesofconditionalprobabilities.Hence,theyavoidtheproblemofvanishingclass-conditionalprobabilitiesandthusgenerallyprovidemorerobustestimatesofposteriorprobabilities.

CharacteristicsofNaïveBayesClassifiers1. NaïveBayesclassifiersareprobabilisticclassificationmodelsthatare

abletoquantifytheuncertaintyinpredictionsbyprovidingposteriorprobabilityestimates.Theyarealsogenerativeclassificationmodelsastheytreatthetargetclassasthecausativefactorforgeneratingthedatainstances.Hence,apartfromcomputingposteriorprobabilities,naïveBayesclassifiersalsoattempttocapturetheunderlyingmechanismbehindthegenerationofdatainstancesbelongingtoeveryclass.Theyarethususefulforgainingpredictiveaswellasdescriptiveinsights.

2. ByusingthenaïveBayesassumption,theycaneasilycomputeclass-conditionalprobabilitieseveninhigh-dimensionalsettings,providedthattheattributesareconditionallyindependentofeachothergiventheclasslabels.ThispropertymakesnaïveBayesclassifierasimpleandeffectiveclassificationtechniquethatiscommonlyusedindiverseapplicationproblems,suchastextclassification.

3. NaïveBayesclassifiersarerobusttoisolatednoisepointsbecausesuchpointsarenotabletosignificantlyimpacttheconditionalprobabilityestimates,astheyareoftenaveragedoutduringtraining.

ncXi=c Y=y

Xi P(Xi=c|y)

nc=0

4. NaïveBayesclassifierscanhandlemissingvaluesinthetrainingsetbyignoringthemissingvaluesofeveryattributewhilecomputingitsconditionalprobabilityestimates.Further,naïveBayesclassifierscaneffectivelyhandlemissingvaluesinatestinstance,byusingonlythenon-missingattributevalueswhilecomputingposteriorprobabilities.Ifthefrequencyofmissingvaluesforaparticularattributevaluedependsonclasslabel,thenthisapproachwillnotaccuratelyestimateposteriorprobabilities.

5. NaïveBayesclassifiersarerobusttoirrelevantattributes.If isanirrelevantattribute,then becomesalmostuniformlydistributedforeveryclassy.Theclass-conditionalprobabilitiesforeveryclassthusreceivesimilarcontributionsof ,resultinginnegligibleimpactontheposteriorprobabilityestimates.

6. CorrelatedattributescandegradetheperformanceofnaïveBayesclassifiersbecausethenaïveBayesassumptionofconditionalindependencenolongerholdsforsuchattributes.Forexample,considerthefollowingprobabilities:

whereAisabinaryattributeandYisabinaryclassvariable.SupposethereisanotherbinaryattributeBthatisperfectlycorrelatedwithAwhen ,butisindependentofAwhen .Forsimplicity,assumethattheconditionalprobabilitiesforBarethesameasforA.Givenaninstancewithattributes ,andassumingconditionalindependence,wecancomputeitsposteriorprobabilitiesasfollows:

If ,thenthenaïveBayesclassifierwouldassigntheinstancetoclass1.However,thetruthis,

XiP(Xi|Y)

P(Xi|Y)

P(A=0|Y=0)=0.4,P(A=1|Y=0)=0.6,P(A=0|Y=1)=0.6,P(A=1|Y=1)=0.4,

Y=0 Y=1

A=0,B=0

P(Y=0|A=0,B=0)=P(A=0|Y=0)P(B=0|Y=0)P(Y=0)P(A=0,B=0)=0.16×P(Y

P(Y=0)=P(Y=1)

P(A=0,B=0|Y=0)=P(A=0|Y=0)=0.4,

becauseAandBareperfectlycorrelatedwhen .Asaresult,theposteriorprobabilityfor is

whichislargerthanthatfor .Theinstanceshouldhavebeenclassifiedasclass0.Hence,thenaïveBayesclassifiercanproduceincorrectresultswhentheattributesarenotconditionallyindependentgiventheclasslabels.NaïveBayesclassifiersarethusnotwell-suitedforhandlingredundantorinteractingattributes.

Y=0Y=0

P(Y=0|A=0,B=0)=P(A=0,B=0|Y=0)P(Y=0)P(A=0,B=0)=0.4×P(Y=0)P(A=0,B=

Y=1

4.5BayesianNetworksTheconditionalindependenceassumptionmadebynaïveBayesclassifiersmayseemtoorigid,especiallyforclassificationproblemswheretheattributesaredependentoneachotherevenafterconditioningontheclasslabels.WethusneedanapproachtorelaxthenaïveBayesassumptionsothatwecancapturemoregenericrepresentationsofconditionalindependenceamongattributes.

Inthissection,wepresentaflexibleframeworkformodelingprobabilisticrelationshipsbetweenattributesandclasslabels,knownasBayesianNetworks.Bybuildingonconceptsfromprobabilitytheoryandgraphtheory,Bayesiannetworksareabletocapturemoregenericformsofconditionalindependenceusingsimpleschematicrepresentations.Theyalsoprovidethenecessarycomputationalstructuretoperforminferencesoverrandomvariablesinanefficientway.Inthefollowing,wefirstdescribethebasicrepresentationofaBayesiannetwork,andthendiscussmethodsforperforminginferenceandlearningmodelparametersinthecontextofclassification.

4.5.1GraphicalRepresentation

Bayesiannetworksbelongtoabroaderfamilyofmodelsforcapturingprobabilisticrelationshipsamongrandomvariables,knownasprobabilisticgraphicalmodels.Thebasicconceptbehindthesemodelsistousegraphicalrepresentationswherethenodesofthegraphcorrespondtorandomvariablesandtheedgesbetweenthenodesexpressprobabilistic

relationships.Figures4.10(a) and4.10(b) showexamplesofprobabilisticgraphicalmodelsusingdirectededges(witharrows)andundirectededges(withoutarrows),respectively.DirectedgraphicalmodelsarealsoknownasBayesiannetworkswhileundirectedgraphicalmodelsareknownasMarkovrandomfields.Thetwoapproachesusedifferentsemanticsforexpressingrelationshipsamongrandomvariablesandarethususefulindifferentcontexts.Inthefollowing,webrieflydescribeBayesiannetworksthatareusefulinthecontextofclassification.

ABayesiannetwork(alsoreferredtoasabeliefnetwork)involvesdirectededgesbetweennodes,whereeveryedgerepresentsadirectionofinfluenceamongrandomvariables.Forexample,Figure4.10(a) showsaBayesiannetworkwherevariableCdependsuponthevaluesofvariablesAandB,asindicatedbythearrowspointingtowardCfromAandB.Consequently,thevariableCinfluencesthevaluesofvariablesDandE.EveryedgeinaBayesiannetworkthusencodesadependencerelationshipbetweenrandomvariableswithaparticulardirectionality.

Figure4.10.Illustrationsoftwobasictypesofgraphicalmodels.

Bayesiannetworksaredirectedacyclicgraphs(DAG)becausetheydonotcontainanydirectedcyclessuchthattheinfluenceofanodeloopsbacktothesamenode.Figure4.11 showssomeexamplesofBayesiannetworksthatcapturedifferenttypesofdependencestructuresamongrandomvariables.Inadirectedacyclicgraph,ifthereisadirectededgefromXtoY,thenXiscalledtheparentofYandYiscalledthechildofX.NotethatanodecanhavemultipleparentsinaBayesiannetwork,e.g.,nodeDhastwoparentnodes,BandC,inFigure4.11(a) .Furthermore,ifthereisadirectedpathinthenetworkfromXtoZ,thenXisanancestorofZ,whileZisadescendantofX.Forexample,inthediagramshowninFigure4.11(b) ,AisadescendantofDandDisanancestorofB.Notethattherecanbemultipledirectedpathsbetweentwonodesofadirectedacyclicgraph,asisthecasefornodesAandDinFigure4.11(a) .

Figure4.11.ExamplesofBayesiannetworks.

ConditionalIndependenceAnimportantpropertyofaBayesiannetworkisitsabilitytorepresentvaryingformsofconditionalindependenceamongrandomvariables.ThereareseveralwaysofdescribingtheconditionalindependenceassumptionscapturedbyBayesiannetworks.Oneofthemostgenericwaysofexpressingconditionalindependenceistheconceptofd-separation,whichcanbeusedtodetermineifanytwosetsofnodesAandBareconditionallyindependentgivenanothersetofnodesC.AnotherusefulconceptisthatoftheMarkovblanketofanodeY,whichdenotestheminimalsetofnodesXthatmakesYindependentoftheothernodesinthegraph,whenconditionedonX.(SeeBibliographicNotesformoredetailsond-separationandMarkovblanket.)However,forthepurposeofclassification,itissufficienttodescribeasimplerexpressionofconditionalindependenceinBayesiannetworks,knownasthelocalMarkovproperty.

Property1(LocalMarkovProperty).AnodeinaBayesiannetworkisconditionallyindependentofitsnon-descendants,ifitsparentsareknown.

ToillustratethelocalMarkovproperty,considertheBayesnetworkshowninFigure4.11(b) .WecanstatethatAisconditionallyindependentofbothBandDgivenC,becauseCistheparentofAandnodesBandDarenon-descendantsofA.ThelocalMarkovpropertyhelpsininterpretingparent-childrelationshipsinBayesiannetworksasrepresentationsofconditionalprobabilities.Sinceanodeisconditionallyindependentofitsnon-descendants

givenitparents,theconditionalindependenceassumptionsimposedbyaBayesiannetworkisoftensparseinstructure.Nonetheless,BayesiannetworksareabletoexpressaricherclassofconditionalindependencestatementsamongattributesandclasslabelsthanthenaïveBayesclassifier.Infact,thenaïveBayesclassifiercanbeviewedasaspecialtypeofBayesiannetwork,wherethetargetclassYisattherootofatreeandeveryattributeisconnectedtotherootnodebyadirectededge,asshowninFigure4.12(a) .

Figure4.12.ComparingthegraphicalrepresentationofanaïveBayesclassifierwiththatofagenericBayesiannetwork.

NotethatinanaïveBayesclassifier,everydirectededgepointsfromthetargetclasstotheobservedattributes,suggestingthattheclasslabelisafactorbehindthegenerationofattributes.Inferringtheclasslabelcanthusbeviewedasdiagnosingtherootcausebehindtheobservedattributes.Ontheotherhand,Bayesiannetworksprovideamoregenericstructureofprobabilisticrelationships,sincethetargetclassisnotrequiredtobeattherootofatreebutcanappearanywhereinthegraph,asshowninFigure

Xi

4.12(b) .Inthisdiagram,inferringYnotonlyhelpsindiagnosingthefactorsinfluencing and ,butalsohelpsinpredictingtheinfluenceof and .

JointProbabilityThelocalMarkovpropertycanbeusedtosuccinctlyexpressthejointprobabilityofthesetofrandomvariablesinvolvedinaBayesiannetwork.Torealizethis,letusfirstconsideraBayesiannetworkconsistingofdnodes,to ,wherethenodeshavebeennumberedinsuchawaythat isanancestorof onlyif .Thejointprobabilityof canbegenericallyfactorizedusingthechainruleofprobabilityas

Bythewaywehaveconstructedthegraph,notethatthesetcontainsonlynon-descendantsof .Hence,byusingthelocalMarkovproperty,wecanwrite as ,where denotestheparentsof .Thejointprobabilitycanthenberepresentedas

Itisthussufficienttorepresenttheprobabilityofeverynode intermsofitsparentnodes, ,forcomputingP( ).Thisisachievedwiththehelpofprobabilitytablesthatassociateeverynodetoitsparentnodesasfollows:

1. Theprobabilitytablefornode containstheconditionalprobabilityvalues foreverycombinationofvaluesin and .

2. If hasnoparents ,thenthetablecontainsonlythepriorprobability .

X3 X4 X1 X2

X1Xd Xi

Xj i<j X={X1,…,Xd}

P(X)=P(X1)P(X2|X1)P(X3|X1,X2)…P(Xd|X1,…Xd−1)=∏i=1dP(Xi|X1,…Xi−1)

(4.22)

{X1,…Xi−1}Xi

P(Xi|X1,…Xi−1) P(Xi|pa(Xi)) pa(Xi)Xi

P(X)=∏i=1dP(Xi|pa(Xi)) (4.23)

Xipa(Xi)

XiP(Xi|pa(Xi)) Xi pa(Xi)

Xi (pa(Xi)=ϕ)P(Xi)

Example4.6.[ProbabilityTables]Figure4.13 showsanexampleofaBayesiannetworkformodelingtherelationshipsbetweenapatient'ssymptomsandriskfactors.Theprobabilitytablesareshownatthesideofeverynodeinthefigure.Theprobabilitytablesassociatedwiththeriskfactors(ExerciseandDiet)containonlythepriorprobabilities,whereasthetablesforheartdisease,heartburn,bloodpressure,andchestpain,containtheconditionalprobabilities.

Figure4.13.ABayesiannetworkfordetectingheartdiseaseandheartburninpatients.

UseofHiddenVariables

ABayesiannetworktypicallyinvolvestwotypesofvariables:observedvariablesthatareclampedtospecificobservedvalues,andunobservedvariables,whosevaluesarenotknownandneedtobeinferredfromthenetwork.Todistinguishbetweenthesetwotypesofvariables,observedvariablesaregenerallyrepresentedusingshadednodeswhileunobservedvariablesarerepresentedusingemptynodes.Figure4.14 showsanexampleofaBayesiannetworkwithobservedvariables(A,B,andE)andunobservedvariables(CandD).

Figure4.14.Observedandunobservedvariablesarerepresentedusingunshadedandshadedcircles,respectively.

Inthecontextofclassification,theobservedvariablescorrespondtothesetofattributesX,whilethetargetclassisrepresentedusinganunobservedvariableYthatneedstobeinferredduringtesting.However,notethatagenericBayesiannetworkmaycontainmanyotherunobservedvariablesapartfromthetargetclass,asrepresentedinFigure4.15 asthesetofvariablesH.Theseunobservedvariablesrepresenthiddenorconfoundingfactorsthataffecttheprobabilitiesofattributesandclasslabels,althoughtheyareneverdirectlyobserved.TheuseofhiddenvariablesenhancestheexpressivepowerofBayesiannetworksinrepresentingcomplexprobabilistic

relationshipsbetweenattributesandclasslabels.ThisisoneofthekeydistinguishingpropertiesofBayesiannetworksascomparedtonaïveBayesclassifiers.

4.5.2InferenceandLearning

GiventheprobabilitytablescorrespondingtoeverynodeinaBayesiannetwork,theproblemofinferencecorrespondstocomputingtheprobabilitiesofdifferentsetsofrandomvariables.Inthecontextofclassification,oneofthekeyinferenceproblemsistocomputetheprobabilityofatargetclassYtakingonaspecificvaluey,giventhesetofobservedattributesatadatainstance,.Thiscanberepresentedusingthefollowingconditionalprobability:

ThepreviousequationinvolvesmarginalprobabilitiesoftheformP(y, ).TheycanbecomputedbymarginalizingoutthehiddenvariablesHfromthejointprobabilityasfollows:

wherethejointprobabilityP(y, ,H)canbeobtainedbyusingthefactorizationdescribedinEquation4.23 .TounderstandthenatureofcomputationsinvolvedinestimatingP(y, ),considertheexampleBayesiannetworkshowninFigure4.15 ,whichinvolvesatargetclass,Y,threeobservedattributes, to ,andfourhiddenvariables, to .Forthisnetwork,wecanexpressP(y, )as

P(Y=y|x)=(y,x)P(x)=(y,x)∑y′P(y′,x) (4.24)

P(y,x)=∑HP(y,x,H), (4.25)

X1 X3 H1 H4

Figure4.15.AnexampleofaBayesiannetworkwithfourhiddenvariables, to ,threeobservedattributes, to ,andonetargetclassY.

wherefisafactorthatdependsonthevaluesof to .IntheprevioussimplisticexpressionofP(y, ),adifferentsummandisconsideredforeverycombinationofvalues, to ,inthehiddenvariables, to .Ifweassumethateveryvariableinthenetworkcantakekdiscretevalues,thenthesummationhastobecarriedoutforatotalnumberof times.Thecomputationalcomplexityofthisapproachisthus .Moreover,thenumberofcomputationsgrowsexponentiallywiththenumberofhiddenvariables,makingitdifficulttousethisapproachwithnetworksthathavealargenumberofhiddenvariables.Inthefollowing,wepresentdifferentcomputationaltechniquesforefficientlyperforminginferencesinBayesiannetworks.

H1 H4X1 X3

P(y,x)=∑h1∑h2∑h3∑h4P(y,x1,x2,h1,h2,h3,h4),=∑h1∑h2∑h3∑h4[P(h1)P(h2)P(x2)P(h4)P(x1|h1,h2)×P(h3|x2,h2)P(y|x1,h3)P(x3|h3,h4)],

(4.26)

=∑h1∑h2∑h3∑h4f(h1,h2,h3,h4), (4.27)

h1 h4

h1 h4 H1 H4

k4O(k4)

VariableEliminationToreducethenumberofcomputationsinvolvedinestimatingP(y, ),letuscloselyexaminetheexpressionsinEquations4.26 and4.27 .Noticethatalthough dependsonthevaluesofallfourhiddenvariables,itcanbedecomposedasaproductofseveralsmallerfactors,whereeveryfactorinvolvesonlyasmallnumberofhiddenvariables.Forexample,thefactor dependsonlyonthevalueof ,andthusactsasaconstantmultiplicativetermwhensummationsareperformedover ,or .Hence,ifweplace outsidethesummationsof to ,wecansavesomerepeatedmultiplicationsoccurringinsideeverysummand.

Ingeneral,wecanpusheverysummationasfarinsideaspossible,sothatthefactorsthatdonotdependonthesummingvariableareplacedoutsidethesummation.Thiswillhelpreducethenumberofwastefulcomputationsbyusingsmallerfactorsateverysummation.Toillustratethisprocess,considerthefollowingsequenceofstepsforcomputingP(y, ),byrearrangingtheorder

ofsummationsinEquation4.26 .

where representstheintermediatefactortermobtainedbysummingout .Tocheckifthepreviousrearrangementsprovideanyimprovementsin

f(h1,h2,h3,h4)

P(h4) h4h1,h2 h3

P(h4) h1 h3

P(y,x)=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)∑h1P(4.28)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)×∑h2P(h2)P(h3|x2,h2)f1(h2)(4.29)

=P(x2)∑h4P(h4)∑h3P(y|x1,h3)P(x3|h3,h4)f2(h3) (4.30)

=P(x2)∑h4P(h4)f3(h4) (4.31)

fi hi

computationalefficiency,letuscountthenumberofcomputationsoccurringateverystepoftheprocess.Atthefirststep(Equation4.28 ),weperformasummationover usingfactorsthatdependon and .Thisrequiresconsideringeverypairofvaluesin and ,resultingin computations.Similarly,thesecondstep(Equation4.29 )involvessummingout usingfactorsof and ,leadingto computations.Thethirdstep(Equation4.30 )againrequires computationsasitinvolvessummingoutoverfactorsdependingon and .Finally,thefourthstep(Equation4.31 )involvessummingout usingfactorsdependingon ,resultinginO(k)computations.

Theoverallcomplexityofthepreviousapproachisthus ,whichisconsiderablysmallerthanthe complexityofthebasicapproach.Hence,bymerelyrearrangingsummationsandusingalgebraicmanipulations,weareabletoimprovethecomputationalefficiencyincomputingP(y, ).Thisprocedureisknownasvariableelimination.

Thebasicconceptthatvariableeliminationexploitstoreducethenumberofcomputationsisthedistributivenatureofmultiplicationoveradditionoperations.Forexample,considerthefollowingmultiplicationandadditionoperations:

Noticethattheright-handsideofthepreviousequationinvolvesthreemultiplicationsandthreeadditions,whiletheleft-handsideinvolvesonlyonemultiplicationandthreeadditions,thussavingontwoarithmeticoperations.Thispropertyisutilizedbyvariableeliminationinpushingoutconstanttermsoutsidethesummation,suchthattheyaremultipliedonlyonce.

h1 h1 h2h1 h2 O(k2)

h2h2 h3 O(k2)

O(k2) h3h3 h4

h4 h4

O(k2)O(k4)

a.(b+c+d)=a.b+a.c+a.d

Notethattheefficiencyofvariableeliminationdependsontheorderofhiddenvariablesusedforperformingsummations.Hence,wewouldideallyliketofindtheoptimalorderofvariablesthatresultinthesmallestnumberofcomputations.Unfortunately,findingtheoptimalorderofsummationsforagenericBayesiannetworkisanNP-Hardproblem,i.e.,theredoesnotexistanefficientalgorithmforfindingtheoptimalorderingthatcanruninpolynomialtime.However,thereexistsefficienttechniquesforhandlingspecialtypesofBayesiannetworks,e.g.,thoseinvolvingtree-likegraphs,asdescribedinthefollowing.

Sum-ProductAlgorithmforTreesNotethatinEquations4.28 and4.29 ,wheneveravariable iseliminatedduringmarginalization,itresultsinthecreationofafactor thatdependsontheneighboringnodesof . isthenabsorbedinthefactorsofneighboringvariablesandtheprocessisrepeateduntilallunobservedvariableshavemarginalized.Thisphenomenaofvariableeliminationcanbeviewedastransmittingalocalmessagefromthevariablebeingmarginalizedtoitsneighboringnodes.Thisideaofmessagepassingutilizesthestructureofthegraphforperformingcomputations,thusmakingitpossibletousegraph-theoreticapproachesformakingeffectiveinferences.Thesum-productalgorithmbuildsontheconceptofmessagepassingforcomputingmarginalandconditionalprobabilitiesontree-basedgraphs.

Figure4.16 showsanexampleofatreeinvolvingfivevariables, to .Akeycharacteristicofatreeisthateverynodeinthetreehasexactlyoneparent,andthereisonlyonedirectededgebetweenanytwonodesinthetree.Forthepurposeofillustration,letusconsidertheproblemofestimatingthemarginalprobabilityof .Thiscanbeobtainedbymarginalizingouteveryvariableinthegraphexcept andrearrangingthesummationstoobtainthefollowingexpression:

hifi

hi fi

X1 X5

X2,P(X2)X2

Figure4.16.AnexampleofaBayesiannetworkwithatreestructure.

where hasbeenconvenientlychosentorepresentthefactorof thatisobtainedbysummingout .Wecanview asalocalmessagepassedfromnode tonode ,asshownusingarrowsinFigure4.17(a) .Theselocalmessagescapturetheinfluenceofeliminatingnodesonthemarginalprobabilitiesofneighboringnodes.

Beforeweformallydescribetheformulaforcomputing and ,wefirstdefineapotentialfunction thatisassociatedeverynodeandedgeofthegraph.Wecandefinethepotentialofanode as

P(x2)=∑x1∑x3∑x4∑x5P(x1)P(x2|x1)P(x3|x2)P(x4|x3)P(x5|x3),=(∑x1P(x1)P(x2|x1))︸m12(x2)(∑x3P(x3|x2)(∑x4P(x4|x3))︸m43(x3)(∑x5P(x5|x3))︸m53(x3)),︸m32(x2)

mij(xj) xjxi mij(xj)

xi xj

mij(xj) P(xj)ψ(⋅)

Xi

ψ(Xi)={P(Xi),ifXiistherootnode.1,otherwise. (4.32)

Figure4.17.Illustrationofmessagepassinginthesum-productalgorithm.

Similarly,wecandefinethepotentialofanedgebetweennodes and(where istheparentof )as

Using and ,wecanrepresent usingthefollowingequation:

whereN(i)representsthesetofneighborsofnode .Themessage thatistransmittedfrom to canthusberecursivelycomputedusingthe

Xi XjXi Xj

ψ(Xi,Xj)=P(Xj|Xi).

ψ(Xi) ψ(Xi,Xj) mij(xj)

mij(xj)=∑xi(ψ(xi)ψ(xi,xj)∏k∈N(i)imki(xi)), (4.33)

Xi mijXi Xj

messagesincidenton fromitsneighboringnodesexcluding .Notethattheformulafor involvestakingasumoverallpossiblevaluesof ,aftermultiplyingthefactorsobtainedfromtheneighborsof .Thisapproachofmessagepassingisthuscalledthe“sum-product”algorithm.Further,sincerepresentsanotionof“belief”propagatedfrom to ,thisalgorithmisalsoknownasbeliefpropagation.Themarginalprobabilityofanode

isthengivenas

Ausefulpropertyofthesum-productalgorithmisthatitallowsthemessagestobereusedforcomputingadifferentmarginalprobabilityinthefuture.Forexample,ifwehadtocomputethemarginalprobabilityfornode ,wewouldrequirethefollowingmessagesfromitsneighboringnodes: ,and .However,notethat ,and havealreadybeencomputedintheprocessofcomputingthemarginalprobabilityof andthuscanbereused.

Noticethatthebasicoperationsofthesum-productalgorithmresembleamessagepassingprotocolovertheedgesofthenetwork.Anodesendsoutamessagetoallitsneighboringnodesonlyafterithasreceivedincomingmessagesfromallitsneighbors.Hence,wecaninitializethemessagepassingprotocolfromtheleafnodes,andtransmitmessagestillwereachtherootnode.Wecanthenrunasecondpassofmessagesfromtherootnodebacktotheleafnodes.Inthisway,wecancomputethemessagesforeveryedgeinbothdirections,usingjust operations,where isthenumberofedges.OncewehavetransmittedallpossiblemessagesasshowninFigure4.17(b) ,wecaneasilycomputethemarginalprobabilityofeverynodeinthegraphusingEquation4.34 .

Xi Ximij Xj

Xjmij

Xi XjXi

P(xi)=ψ(xi)∏j∈N(i)mji(xi). (4.34)

X3m23(x3),m43(x3)

m53(x3) m43(x3) m53(x3)X2

O(2|E|) |E|

Inthecontextofclassification,thesum-productalgorithmcanbeeasilymodifiedforcomputingtheconditionalprobabilityoftheclasslabelygiventhesetofobservedattributes ,i.e., .Thisbasicallyamountstocomputing inEquation4.24 ,whereXisclampedtotheobservedvalues .Tohandlethescenariowheresomeoftherandomvariablesarefixedanddonotneedtobenormalized,weconsiderthefollowingmodification.

If isarandomvariablethatisfixedtoaspecificvalue ,thenwecansimplymodify and asfollows:

Wecanrunthesum-productalgorithmusingthesemodifiedvaluesforeveryobservedvariableandthuscompute .

x^ P(y|x^)P(y,X=x^)

x^

Xi x^iψ(Xi) ψ(Xi,Xj)

ψ(Xi)={1,ifXi=x^i.0,otherwise. (4.35)

ψ(Xi,Xj)={P(Xi|x^i),ifXi=x^i.0,otherwise. (4.36)

P(y,X=x^)

Figure4.18.Exampleofapoly-treeanditscorrespondingfactorgraph.

GeneralizationsforNon-TreeGraphsThesum-productalgorithmisguaranteedtooptimallyconvergeinthecaseoftreesusingasinglerunofmessagepassinginbothdirectionsofeveryedge.Thisisbecauseanytwonodesinatreehaveauniquepathforthetransmissionofmessages.Furthermore,sinceeverynodeinatreehasasingleparent,thejointprobabilityinvolvesonlyfactorsofatmosttwovariables.Hence,itissufficienttoconsiderpotentialsoveredgesandnotothergenericsubstructuresinthegraph.

Bothofthepreviouspropertiesareviolatedingraphsthatarenottrees,thusmakingitdifficulttodirectlyapplythesum-productalgorithmformakinginferences.However,anumberofvariantsofthesum-productalgorithmhavebeendevisedtoperforminferencesonabroaderfamilyofgraphsthantrees.Manyofthesevariantstransformtheoriginalgraphintoanalternativetree-basedrepresentation,andthenapplythesum-productalgorithmonthetransformedtree.Inthissection,webrieflydiscussonesuchtransformationsknownasfactorgraphs.

Factorgraphsareusefulformakinginferencesovergraphsthatviolatetheconditionthateverynodehasasingleparent.Nonetheless,theystillrequiretheabsenceofmultiplepathsbetweenanytwonodes,toguaranteeconvergence.Suchgraphsareknownaspoly-trees.Anexampleofapoly-treeisshowninFigure4.18(a) .

Apoly-treecanbetransformedintoatree-basedrepresentationwiththehelpoffactorgraphs.Thesegraphsconsistoftwotypesofnodes,variablesnodes(thatarerepresentedusingcircles)andfactornodes(thatarerepresented

usingsquares).Thefactornodesrepresentconditionalindependencerelationshipsamongthevariablesofthepoly-tree.Inparticular,everyprobabilitytablecanberepresentedasafactornode.Theedgesinafactorgraphareundirectedinnatureandrelateavariablenodetoafactornodeifthevariableisinvolvedintheprobabilitytablecorrespondingtothefactornode.Figure4.18(b) presentsthefactorgraphrepresentationofthepoly-treeshowninFigure4.18(a) .

Notethatthefactorgraphofapoly-treealwaysformsatree-likestructure,wherethereisauniquepathofinfluencebetweenanytwonodesinthefactorgraph.Hence,wecanapplyamodifiedformofsum-productalgorithmtotransmitmessagesbetweenvariablenodesandfactornodes,whichisguaranteedtoconvergetooptimalvalues.

LearningModelParametersInallourpreviousdiscussionsonBayesiannetworks,wehadassumedthatthetopologyoftheBayesiannetworkandthevaluesintheprobabilitytablesofeverynodewerealreadyknown.Inthissection,wediscussapproachesforlearningboththetopologyandtheprobabilitytablevaluesofaBayesiannetworkfromthetrainingdata.

Letusfirstconsiderthecasewherethetopologyofthenetworkisknownandweareonlyrequiredtocomputetheprobabilitytables.Iftherearenounobservedvariablesinthetrainingdata,thenwecaneasilycomputetheprobabilitytablefor ,bycountingthefractionoftraininginstancesforeveryvalueof andeverycombinationofvaluesin .However,ifthereareunobservedvariablesin or ,thencomputingthefractionoftraininginstancesforsuchvariablesisnon-trivialandrequirestheuseofadvancestechniquessuchastheExpectation-Maximizationalgorithm(describedlaterinChapter8 ).

P(Xi|pa(Xi))Xi pa(Xi)

Xi pa(Xi)

LearningthestructureoftheBayesiannetworkisamuchmorechallengingtaskthanlearningtheprobabilitytables.Althoughtherearesomescoringapproachesthatattempttofindagraphstructurethatmaximizesthetraininglikelihood,theyareoftencomputationallyinfeasiblewhenthegraphislarge.Hence,acommonapproachforconstructingBayesiannetworksistousethesubjectiveknowledgeofdomainexperts.

4.5.3CharacteristicsofBayesianNetworks

1. Bayesiannetworksprovideapowerfulapproachforrepresentingprobabilisticrelationshipsbetweenattributesandclasslabelswiththehelpofgraphicalmodels.Theyareabletocapturecomplexformsofdependenciesamongvariables.Apartfromencodingpriorbeliefs,theyarealsoabletomodelthepresenceoflatent(unobserved)factorsashiddenvariablesinthegraph.Bayesiannetworksarethusquiteexpressiveandprovidepredictiveaswellasdescriptiveinsightsaboutthebehaviorofattributesandclasslabels.

2. Bayesiannetworkscaneasilyhandlethepresenceofcorrelatedorredundantattributes,asopposedtothenaïveBayesclassifier.ThisisbecauseBayesiannetworksdonotusethenaïveBayesassumptionaboutconditionalindependence,butinsteadareabletoexpressricherformsofconditionalindependence.

3. SimilartothenaïveBayesclassifier,Bayesiannetworksarealsoquiterobusttothepresenceofnoiseinthetrainingdata.Further,theycanhandlemissingvaluesduringtrainingaswellastesting.Ifatestinstancecontainsanattribute withamissingvalue,thenaBayesiannetworkcanperforminferencebytreating asanunobservednode

XiXi

andmarginalizingoutitseffectonthetargetclass.Hence,Bayesiannetworksarewell-suitedforhandlingincompletenessinthedata,andcanworkwithpartialinformation.However,unlessthepatternwithwhichmissingvaluesoccursiscompletelyrandom,thentheirpresencewilllikelyintroducesomedegreeoferrorand/orbiasintotheanalysis.

4. Bayesiannetworksarerobusttoirrelevantattributesthatcontainnodiscriminatoryinformationabouttheclasslabels.Suchattributesshownoimpactontheconditionalprobabilityofthetargetclass,andarethusrightfullyignored.

5. LearningthestructureofaBayesiannetworkisacumbersometaskthatoftenrequiresassistancefromexpertknowledge.However,oncethestructurehasbeendecided,learningtheparametersofthenetworkcanbequitestraightforward,especiallyifallthevariablesinthenetworkareobserved.

6. Duetoitsadditionalabilityofrepresentingcomplexformsofrelationships,BayesiannetworksaremoresusceptibletooverfittingascomparedtothenaïveBayesclassifier.Furthermore,BayesiannetworkstypicallyrequiremoretraininginstancesforeffectivelylearningtheprobabilitytablesthanthenaïveBayesclassifier.

7. Althoughthesum-productalgorithmprovidescomputationallyefficienttechniquesforperforminginferenceovertree-likegraphs,thecomplexityoftheapproachincreasesignificantlywhendealingwithgenericgraphsoflargesizes.Insituationswhereexactinferenceiscomputationallyinfeasible,itisquitecommontouseapproximateinferencetechniques.

4.6LogisticRegressionThenaïveBayesandtheBayesiannetworkclassifiersdescribedintheprevioussectionsprovidedifferentwaysofestimatingtheconditionalprobabilityofaninstance givenclassy, .Suchmodelsareknownasprobabilisticgenerativemodels.Notethattheconditionalprobabilityessentiallydescribesthebehaviorofinstancesintheattributespacethataregeneratedfromclassy.However,forthepurposeofmakingpredictions,wearefinallyinterestedincomputingtheposteriorprobability .Forexample,computingthefollowingratioofposteriorprobabilitiesissufficientforinferringclasslabelsinabinaryclassificationproblem:

Thisratioisknownastheodds.Ifthisratioisgreaterthan1,then isclassifiedas .Otherwise,itisassignedtoclass .Hence,onemaysimplylearnamodeloftheoddsbasedontheattributevaluesoftraininginstances,withouthavingtocompute asanintermediatequantityintheBayestheorem.

Classificationmodelsthatdirectlyassignclasslabelswithoutcomputingclass-conditionalprobabilitiesarecalleddiscriminativemodels.Inthissection,wepresentaprobabilisticdiscriminativemodelknownaslogisticregression,whichdirectlyestimatestheoddsofadatainstance usingitsattributevalues.Thebasicideaoflogisticregressionistousealinearpredictor,

,forrepresentingtheoddsof asfollows:

P(x|y)P(x|y)

P(y|x)

P(y=1|x)P(y=0|x)

y=1 y=0

P(x|y)

z=wTx+b

P(y=1|x)P(y=0|x)=ez=ewTx+b, (4.37)

where andbaretheparametersofthemodeland denotesthetransposeofavector .Notethatif ,then belongstoclass1sinceitsoddsisgreaterthan1.Otherwise, belongstoclass0.

Figure4.19.Plotofsigmoid(logistic)function, .

Since ,wecanre-writeEquation4.37 as

Thiscanbefurthersimplifiedtoexpress asafunctionofz.

wherethefunction isknownasthelogisticorsigmoidfunction.Figure4.19 showsthebehaviorofthesigmoidfunctionaswevaryz.Wecanseethat onlywhen .Wecanalsoderive using asfollows:

aTwTx+b>0

σ(z)

P(y=0|x)+P(y=1|x)=1

P(y=1|x)1−P(y=1|x)=ez.

P(y=1|x)

P(y=1|x)=11+e−z=σ(z), (4.38)

σ(⋅)

σ(z)≥0.5 z≥0 P(y=0|x) σ(z)

Hence,ifwehavelearnedasuitablevalueofparameters andb,wecanuseEquations4.38 and4.39 toestimatetheposteriorprobabilitiesofanydatainstance anddetermineitsclasslabel.

4.6.1LogisticRegressionasaGeneralizedLinearModel

Sincetheposteriorprobabilitiesarereal-valued,theirestimationusingthepreviousequationscanbeviewedassolvingaregressionproblem.Infact,logisticregressionbelongstoabroaderfamilyofstatisticalregressionmodels,knownasgeneralizedlinearmodels(GLM).Inthesemodels,thetargetvariableyisconsideredtobegeneratedfromaprobabilitydistribution ,whosemean canbeestimatedusingalinkfunction asfollows:

Forbinaryclassificationusinglogisticregression,yfollowsaBernoullidistribution(ycaneitherbe0or1)and isequalto .Thelinkfunction oflogisticregression,calledthelogitfunction,canthusberepresentedas

Dependingonthechoiceoflinkfunction andtheformofprobabilitydistribution ,GLMsareabletorepresentabroadfamilyofregressionmodels,suchaslinearregressionandPoissonregression.Theyrequire

P(y=0|x)=1−σ(z)=11+e−z (4.39)

P(y|x)μ g(⋅)

g(μ)=z=wTx+b. (4.40)

μ P(y=1|x)g(⋅)

g(μ)=log(μ1−μ).

g(⋅)P(y|x)

differentapproachesforestimatingtheirmodelparameters,( , ).Inthischapter,wewillonlydiscussapproachesforestimatingthemodelparametersoflogisticregression,althoughmethodsforestimatingparametersofothertypesofGLMsareoftensimilar(andsometimesevensimpler).(SeeBibliographicNotesformoredetailsonGLMs.)

Notethateventhoughlogisticregressionhasrelationshipswithregressionmodels,itisaclassificationmodelsincethecomputedposteriorprobabilitiesareeventuallyusedtodeterminetheclasslabelofadatainstance.

4.6.2LearningModelParameters

Theparametersoflogisticregression,( , ),areestimatedduringtrainingusingastatisticalapproachknownasthemaximumlikelihoodestimation(MLE)method.Thismethodinvolvescomputingthelikelihoodofobservingthetrainingdatagiven( , ),andthendeterminingthemodelparameters

thatyieldmaximumlikelihood.

Let denoteasetofntraininginstances,where isabinaryvariable(0or1).Foragiventraininginstance,wecancomputeitsposteriorprobabilitiesusingEquations4.38 and

4.39 .Wecanthenexpressthelikelihoodofobserving given , ,andbas

where isthesigmoidfunctionasdescribedabove,Equation4.41basicallymeansthatthelikelihood isequalto when

(w*,b*)

D.train={(x1,y1),(x2,y2),…,(xn,yn)}yi

xiyi xi

P(yi|xi,w,b)=P(y=1|xi)yi×P(y=0|xi)1−yi,=(σ(zi))yi×(1−σ(zi))1−yi,=(σ(wTxi+b))yi×(1−σ(wTxi+b))1−yi,

(4.41)

σ(⋅)P(yi|xi,w,b) P(y=1|xi)

,andequalto when .Thelikelihoodofalltraininginstances,,canthenbecomputedbytakingtheproductofindividuallikelihoods

(assumingindependenceamongtraininginstances)asfollows:

Thepreviousequationinvolvesmultiplyingalargenumberofprobabilityvalues,eachofwhicharesmallerthanorequalto1.Sincethisnaïvecomputationcaneasilybecomenumericallyunstablewhennislarge,amorepracticalapproachistoconsiderthenegativelogarithm(tobasee)ofthelikelihoodfunction,alsoknownasthecrossentropyfunction:

Thecrossentropyisalossfunctionthatmeasureshowunlikelyitisforthetrainingdatatobegeneratedfromthelogisticregressionmodelwithparameters( , ).Intuitively,wewouldliketofindmodelparametersthatresultinthelowestcrossentropy, .

where isthelossfunction.ItisworthemphasizingthatE( , )isaconvexfunction,i.e.,anyminimaofE( , )willbeaglobalminima.Hence,wecanuseanyofthestandardconvexoptimizationtechniquestosolveEquation4.43 ,whicharementionedinAppendixE.Here,webrieflydescribetheNewton-Raphsonmethodthatiscommonlyusedforestimatingtheparametersoflogisticregression.Foreaseofrepresentation,wewilluseasinglevectortodescribe ,whichisofsizeonegreaterthan .Similarly,wewillconsidertheconcatenatedfeaturevector ,suchthatthelinearpredictor canbesuccinctly

yi=1 P(y=0|xi) yi=0L(w,b)

L(w,b)=∏i=1nP(yi|xi,w,b)=∏i=1nP(y=1|xi)yi×P(y=0|xi)1−yi. (4.42)

−logL(w,b)=−∑i=1nyilog(P(y=1|xi))+(1−yi)log(P(y=0|xi)).=−∑i=1nyilog(σ(wTxi+b))+(1−yi)log(1−σ(wTxi+b)).

(w*,b*)−logL(w*,b*)

(w*,b*)=argmin(w,b)E(w,b)=argmin(w,b)−logL(w,b) (4.43)

E(w,b)=−logL(w,b)

w˜=(wTb)T

x˜=(xT1)T z=wTx+b

writtenas .Also,theconcatenationofalltraininglabels, to ,willberepresentedasy,thesetconsistingof to willberepresentedas,andtheconcatenationof to willberepresentedas .

TheNewton-Raphsonisaniterativemethodforfinding thatusesthefollowingequationtoupdatethemodelparametersateveryiteration:

where andHarethefirst-andsecond-orderderivativesofthelossfunction withrespectto ,respectively.ThekeyintuitionbehindEquation4.44 istomovethemodelparametersinthedirectionofmaximumgradient,suchthat takeslargerstepswhen islarge.When arrivesataminimaaftersomenumberofiterations,thenwouldbecomeequalto0andthusresultinconvergence.Hence,westartwithsomeinitialvaluesof (eitherrandomlyassignedorsetto0)anduseEquation4.44 toiterativelyupdate tilltherearenosignificantchangesinitsvalue(beyondacertainthreshold).

Thefirst-orderderivativeof isgivenby

wherewehaveusedthefactthat .Using ,wecancomputethesecond-orderderivativeof as

whereRisadiagonalmatrixwhosei diagonalelement .Wecannowusethefirst-andsecond-orderderivativesof inEquation4.44 to

z=w˜Tx˜ y1 ynσ(z1) σ(zn)

σ x˜1 x˜n X˜

w˜*

w˜(new)=w˜(old)−H−1∇E(w˜), (4.44)

∇E(w˜)E(w˜) w˜

w˜ ∇E(w˜)w˜ ∇E(w˜)

w˜w˜

E(w˜)

∇E(w˜)=−∑i=1nyix˜i(1−σ(w˜Tx˜i))−(1−yi)x˜iσ(w˜Tx˜i),=−∑i=1n(σ(w˜Tx˜i)−yi)x˜i,=X˜(σ−y),

(4.45)

dσ(z)/dz=σ(z)(1−σ(z)) ∇E(w˜)E(w˜)

H=∇∇E(w˜)=∑i=1nσ(w˜Tx˜i)(1−σ(w˜Tx˜i)x˜ix˜iT)=X˜TRX˜, (4.46)

th Rii=σi(1−σi)E(w˜)

th

obtainthefollowingupdateequationatthek iteration:

wherethesubscriptkunder and referstousing tocomputebothterms.

4.6.3CharacteristicsofLogisticRegression

1. LogisticRegressionisadiscriminativemodelforclassificationthatdirectlycomputestheposterprobabilitieswithoutmakinganyassumptionabouttheclassconditionalprobabilities.Hence,itisquitegenericandcanbeappliedindiverseapplications.Itcanalsobeeasilyextendedtomulticlassclassification,whereitisknownasmultinomiallogisticregression.However,itsexpressivepowerislimitedtolearningonlylineardecisionboundaries.

2. Becausetherearedifferentweights(parameters)foreveryattribute,thelearnedparametersoflogisticregressioncanbeanalyzedtounderstandtherelationshipsbetweenattributesandclasslabels.

3. Becauselogisticregressiondoesnotinvolvecomputingdensitiesanddistancesintheattributespace,itcanworkmorerobustlyeveninhigh-dimensionalsettingsthandistance-basedmethodssuchasnearestneighborclassifiers.However,theobjectivefunctionoflogisticregressiondoesnotinvolveanytermrelatingtothecomplexityofthemodel.Hence,logisticregressiondoesnotprovideawaytomakeatrade-offbetweenmodelcomplexityandtrainingperformance,ascomparedtootherclassificationmodelssuchassupportvector

th

w˜(k+1)=w˜(k)−(X˜TRkX˜)−1X˜T(σk−y) (4.47)

Rk σk w˜(k)

machines.Nevertheless,variantsoflogisticregressioncaneasilybedevelopedtoaccountformodelcomplexity,byincludingappropriatetermsintheobjectivefunctionalongwiththecrossentropyfunction.

4. Logisticregressioncanhandleirrelevantattributesbylearningweightparameterscloseto0forattributesthatdonotprovideanygaininperformanceduringtraining.Itcanalsohandleinteractingattributessincethelearningofmodelparametersisachievedinajointfashionbyconsideringtheeffectsofallattributestogether.Furthermore,ifthereareredundantattributesthatareduplicatesofeachother,thenlogisticregressioncanlearnequalweightsforeveryredundantattribute,withoutdegradingclassificationperformance.However,thepresenceofalargenumberofirrelevantorredundantattributesinhigh-dimensionalsettingscanmakelogisticregressionsusceptibletomodeloverfitting.

5. Logisticregressioncannothandledatainstanceswithmissingvalues,sincetheposteriorprobabilitiesareonlycomputedbytakingaweightedsumofalltheattributes.Iftherearemissingvaluesinatraininginstance,itcanbediscardedfromthetrainingset.However,iftherearemissingvaluesinatestinstance,thenlogisticregressionwouldfailtopredictitsclasslabel.

4.7ArtificialNeuralNetwork(ANN)Artificialneuralnetworks(ANN)arepowerfulclassificationmodelsthatareabletolearnhighlycomplexandnonlineardecisionboundariespurelyfromthedata.Theyhavegainedwidespreadacceptanceinseveralapplicationssuchasvision,speech,andlanguageprocessing,wheretheyhavebeenrepeatedlyshowntooutperformotherclassificationmodels(andinsomecasesevenhumanperformance).Historically,thestudyofartificialneuralnetworkswasinspiredbyattemptstoemulatebiologicalneuralsystems.Thehumanbrainconsistsprimarilyofnervecellscalledneurons,linkedtogetherwithotherneuronsviastrandsoffibercalledaxons.Wheneveraneuronisstimulated(e.g.,inresponsetoastimuli),ittransmitsnerveactivationsviaaxonstootherneurons.Thereceptorneuronscollectthesenerveactivationsusingstructurescalleddendrites,whichareextensionsfromthecellbodyoftheneuron.Thestrengthofthecontactpointbetweenadendriteandanaxon,knownasasynapse,determinestheconnectivitybetweenneurons.Neuroscientistshavediscoveredthatthehumanbrainlearnsbychangingthestrengthofthesynapticconnectionbetweenneuronsuponrepeatedstimulationbythesameimpulse.

Thehumanbrainconsistsofapproximately100billionneuronsthatareinter-connectedincomplexways,makingitpossibleforustolearnnewtasksandperformregularactivities.Notethatasingleneurononlyperformsasimplemodularfunction,whichistorespondtothenerveactivationscomingfromsenderneuronsconnectedatitsdendrite,andtransmititsactivationtoreceptorneuronsviaaxons.However,itisthecompositionofthesesimplefunctionsthattogetherisabletoexpresscomplexfunctions.Thisideaisatthebasisofconstructingartificialneuralnetworks.

Analogoustothestructureofahumanbrain,anartificialneuralnetworkiscomposedofanumberofprocessingunits,callednodes,thatareconnectedwitheachotherviadirectedlinks.Thenodescorrespondtoneuronsthatperformthebasicunitsofcomputation,whilethedirectedlinkscorrespondtoconnectionsbetweenneurons,consistingofaxonsanddendrites.Further,theweightofadirectedlinkbetweentwoneuronsrepresentsthestrengthofthesynapticconnectionbetweenneurons.Asinbiologicalneuralsystems,theprimaryobjectiveofANNistoadapttheweightsofthelinksuntiltheyfittheinput-outputrelationshipsoftheunderlyingdata.

ThebasicmotivationbehindusinganANNmodelistoextractusefulfeaturesfromtheoriginalattributesthataremostrelevantforclassification.Traditionally,featureextractionhasbeenachievedbyusingdimensionalityreductiontechniquessuchasPCA(introducedinChapter2),whichshowlimitedsuccessinextractingnonlinearfeatures,orbyusinghand-craftedfeaturesprovidedbydomainexperts.Byusingacomplexcombinationofinter-connectednodes,ANNmodelsareabletoextractmuchrichersetsoffeatures,resultingingoodclassificationperformance.Moreover,ANNmodelsprovideanaturalwayofrepresentingfeaturesatmultiplelevelsofabstraction,wherecomplexfeaturesareseenascompositionsofsimplerfeatures.Inmanyclassificationproblems,modelingsuchahierarchyoffeaturesturnsouttobeveryuseful.Forexample,inordertodetectahumanfaceinanimage,wecanfirstidentifylow-levelfeaturessuchassharpedgeswithdifferentgradientsandorientations.Thesefeaturescanthenbecombinedtoidentifyfacialpartssuchaseyes,nose,ears,andlips.Finally,anappropriatearrangementoffacialpartscanbeusedtocorrectlyidentifyahumanface.ANNmodelsprovideapowerfularchitecturetorepresentahierarchicalabstractionoffeatures,fromlowerlevelsofabstraction(e.g.,edges)tohigherlevels(e.g.,facialparts).

Artificialneuralnetworkshavehadalonghistoryofdevelopmentsspanningoverfivedecadesofresearch.AlthoughclassicalmodelsofANNsufferedfromseveralchallengesthathinderedprogressforalongtime,theyhavere-emergedwithwidespreadpopularitybecauseofanumberofrecentdevelopmentsinthelastdecade,collectivelyknownasdeeplearning.Inthissection,weexamineclassicalapproachesforlearningANNmodels,startingfromthesimplestmodelcalledperceptronstomorecomplexarchitecturescalledmulti-layerneuralnetworks.Inthenextsection,wediscusssomeoftherecentadvancementsintheareaofANNthathavemadeitpossibletoeffectivelylearnmodernANNmodelswithdeeparchitectures.

4.7.1Perceptron

AperceptronisabasictypeofANNmodelthatinvolvestwotypesofnodes:inputnodes,whichareusedtorepresenttheinputattributes,andanoutputnode,whichisusedtorepresentthemodeloutput.Figure4.20 illustratesthebasicarchitectureofaperceptronthattakesthreeinputattributes, ,and ,andproducesabinaryoutputy.Theinputnodecorrespondingtoanattribute isconnectedviaaweightedlink totheoutputnode.Theweightedlinkisusedtoemulatethestrengthofasynapticconnectionbetweenneurons.

x1,x2x3

xi wi

Figure4.20.Basicarchitectureofaperceptron.

Theoutputnodeisamathematicaldevicethatcomputesaweightedsumofitsinputs,addsabiasfactorbtothesum,andthenexaminesthesignoftheresulttoproducetheoutput asfollows:

Tosimplifynotations, andbcanbeconcatenatedtoform ,whilecanbeappendedwith1attheendtoform .Theoutputofthe

perceptron canthenbewritten:

wherethesignfunctionactsasanactivationfunctionbyprovidinganoutputvalueof iftheargumentispositiveand ifitsargumentisnegative.

LearningthePerceptronGivenatrainingset,weareinterestedinlearningparameters suchthatcloselyresemblesthetrueyoftraininginstances.ThisisachievedbyusingtheperceptronlearningalgorithmgiveninAlgorithm4.3 .ThekeycomputationforthisalgorithmistheiterativeweightupdateformulagiveninStep8ofthealgorithm:

where istheweightparameterassociatedwiththei inputlinkafterthek iteration, isaparameterknownasthelearningrate,and isthevalue

y^

3^y={1,ifwTx+b>0.−1,otherwise. (4.48)

w˜=(wTb)Tx˜=(xT1)T

y^

y^=sign(w˜Tx˜),

+1 −1

w˜ y^

wj(k+1)=wj(k)+λ(yi−yi^(k))xij, (4.49)

w(k) th

th λ xij

th

ofthej attributeofthetrainingexample .ThejustificationforEquation4.49 isratherintuitive.Notethat capturesthediscrepancybetweenand ,suchthatitsvalueis0onlywhenthetruelabelandthepredicted

outputmatch.Assume ispositive.If and ,then isincreasedatthenextiterationsothat canbecomepositive.Ontheotherhand,if

and ,then isdecreasedsothat canbecomenegative.Hence,theweightsaremodifiedateveryiterationtoreducethediscrepanciesbetween andyacrossalltraininginstances.Thelearningrate ,aparameterwhosevalueisbetween0and1,canbeusedtocontroltheamountofadjustmentsmadeineachiteration.Thealgorithmhaltswhentheaveragenumberofdiscrepanciesaresmallerthanathreshold .

Algorithm4.3Perceptronlearningalgorithm.

λ

∑ γ

Theperceptronisasimpleclassificationmodelthatisdesignedtolearnlineardecisionboundariesintheattributespace.Figure4.21 showsthedecision

th xi(yi−y^i)

yi y^ixij y^=0 y=1 wjw˜Txi

y^=1 y=0 wj w˜Txi

y^ λ

γ

boundaryobtainedbyapplyingtheperceptronlearningalgorithmtothedatasetprovidedontheleftofthefigure.However,notethattherecanbemultipledecisionboundariesthatcanseparatethetwoclasses,andtheperceptronarbitrarilylearnsoneoftheseboundariesdependingontherandominitialvaluesofparameters.(TheselectionoftheoptimaldecisionboundaryisaproblemthatwillberevisitedinthecontextofsupportvectormachinesinSection4.9 .)Further,theperceptronlearningalgorithmisonlyguaranteedtoconvergewhentheclassesarelinearlyseparable.However,iftheclassesarenotlinearlyseparable,thealgorithmfailstoconverge.Figure4.22showsanexampleofanonlinearlyseparabledatagivenbytheXORfunction.Theperceptroncannotfindtherightsolutionforthisdatabecausethereisnolineardecisionboundarythatcanperfectlyseparatethetraininginstances.Thus,thestoppingconditionatline12ofAlgorithm4.3 wouldneverbemetandhence,theperceptronlearningalgorithmwouldfailtoconverge.Thisisamajorlimitationofperceptronssincereal-worldclassificationproblemsofteninvolvenonlinearlyseparableclasses.

Figure4.21.Perceptrondecisionboundaryforthedatagivenontheleft( representsapositivelylabeledinstancewhileorepresentsanegativelylabeledinstance.

+

Figure4.22.XORclassificationproblem.Nolinearhyperplanecanseparatethetwoclasses.

4.7.2Multi-layerNeuralNetwork

Amulti-layerneuralnetworkgeneralizesthebasicconceptofaperceptrontomorecomplexarchitecturesofnodesthatarecapableoflearningnonlineardecisionboundaries.Agenericarchitectureofamulti-layerneuralnetworkisshowninFigure4.23 wherethenodesarearrangedingroupscalledlayers.Theselayersarecommonlyorganizedintheformofachainsuchthateverylayeroperatesontheoutputsofitsprecedinglayer.Inthisway,thelayersrepresentdifferentlevelsofabstractionthatareappliedontheinputfeaturesinasequentialmanner.Thecompositionoftheseabstractionsgeneratesthefinaloutputatthelastlayer,whichisusedformakingpredictions.Inthefollowing,webrieflydescribethethreetypesoflayersusedinmulti-layerneuralnetworks.

Figure4.23.Exampleofamulti-layerartificialneuralnetwork(ANN).

Thefirstlayerofthenetwork,calledtheinputlayer,isusedforrepresentinginputsfromattributes.Everynumericalorbinaryattributeistypicallyrepresentedusingasinglenodeonthislayer,whileacategoricalattributeiseitherrepresentedusingadifferentnodeforeachcategoricalvalue,orbyencodingthek-aryattributeusing inputnodes.Theseinputsarefedintointermediarylayersknownashiddenlayers,whicharemadeupofprocessingunitsknownashiddennodes.Everyhiddennodeoperatesonsignalsreceivedfromtheinputnodesorhiddennodesattheprecedinglayer,andproducesanactivationvaluethatistransmittedtothenextlayer.Thefinallayeriscalledtheoutputlayerandprocessestheactivationvaluesfromitsprecedinglayertoproducepredictionsofoutputvariables.Forbinaryclassification,theoutputlayercontainsasinglenoderepresentingthebinaryclasslabel.Inthisarchitecture,sincethesignalsarepropagatedonlyintheforwarddirectionfromtheinputlayertotheoutputlayer,theyarealsocalledfeedforwardneuralnetworks.

⌈log2k⌉

Amajordifferencebetweenmulti-layerneuralnetworksandperceptronsistheinclusionofhiddenlayers,whichdramaticallyimprovestheirabilitytorepresentarbitrarilycomplexdecisionboundaries.Forexample,considertheXORproblemdescribedintheprevioussection.Theinstancescanbeclassifiedusingtwohyperplanesthatpartitiontheinputspaceintotheirrespectiveclasses,asshowninFigure4.24(a) .Becauseaperceptroncancreateonlyonehyperplane,itcannotfindtheoptimalsolution.However,thisproblemcanbeaddressedbyusingahiddenlayerconsistingoftwonodes,asshowninFigure4.24(b) .Intuitively,wecanthinkofeachhiddennodeasaperceptronthattriestoconstructoneofthetwohyperplanes,whiletheoutputnodesimplycombinestheresultsoftheperceptronstoyieldthedecisionboundaryshowninFigure4.24(a) .

Figure4.24.Atwo-layerneuralnetworkfortheXORproblem.

Thehiddennodescanbeviewedaslearninglatentrepresentationsorfeaturesthatareusefulfordistinguishingbetweentheclasses.Whilethefirsthiddenlayerdirectlyoperatesontheinputattributesandthuscapturessimplerfeatures,thesubsequenthiddenlayersareabletocombinethemand

constructmorecomplexfeatures.Fromthisperspective,multi-layerneuralnetworkslearnahierarchyoffeaturesatdifferentlevelsofabstractionthatarefinallycombinedattheoutputnodestomakepredictions.Further,therearecombinatoriallymanywayswecancombinethefeatureslearnedatthehiddenlayersofANN,makingthemhighlyexpressive.ThispropertychieflydistinguishesANNfromotherclassificationmodelssuchasdecisiontrees,whichcanlearnpartitionsintheattributespacebutareunabletocombinetheminexponentialways.

Figure4.25.SchematicillustrationoftheparametersofanANNmodelwith hiddenlayers.

TounderstandthenatureofcomputationshappeningatthehiddenandoutputnodesofANN,considerthei nodeatthel layerofthenetwork ,wherethelayersarenumberedfrom0(inputlayer)toL(outputlayer),asshowninFigure4.25 .Theactivationvaluegeneratedatthisnode, ,canberepresentedasafunctionoftheinputsreceivedfromnodesattheprecedinglayer.Let representtheweightoftheconnectionfromthej nodeatlayer

(L−1)

th th (l>0)

ail

wijl th

th

tothei nodeatlayerl.Similarly,letusdenotethebiastermatthisnodeas .Theactivationvalue canthenbeexpressedas

whereziscalledthelinearpredictorand istheactivationfunctionthatconvertsztoa.Further,notethat,bydefinition, attheinputlayerand

attheoutputnode.

Thereareanumberofalternateactivationfunctionsapartfromthesignfunctionthatcanbeusedinmulti-layerneuralnetworks.Someexamplesincludelinear,sigmoid(logistic),andhyperbolictangentfunctions,asshowninFigure4.26 .Thesefunctionsareabletoproducereal-valuedandnonlinearactivationvalues.Amongtheseactivationfunctions,thesigmoid hasbeenwidelyusedinmanyANNmodels,althoughtheuseofothertypesofactivationfunctionsinthecontextofdeeplearningwillbediscussedinSection4.8 .Wecanthusrepresent as

(l−1) th

bjl ail

ail=f(zil)=f(∑jwijlajl−1+bil),

f(⋅)aj0=xj

aL=y^

σ(⋅)

ail

Figure4.26.Typesofactivationfunctionsusedinmulti-layerneuralnetworks.

LearningModelParametersTheweightsandbiasterms( ,b)oftheANNmodelarelearnedduringtrainingsothatthepredictionsontraininginstancesmatchthetruelabels.Thisisachievedbyusingalossfunction

ail=σ(zil)=11+e−zil. (4.50)

E(w,b)=∑k=1nLoss(yk,y^k) (4.51)

where isthetruelabelofthekthtraininginstanceand isequalto ,producedbyusing .Atypicalchoiceofthelossfunctionisthesquaredlossfunction:.

NotethatE( ,b)isafunctionofthemodelparameters( ,b)becausetheoutputactivationvalue dependsontheweightsandbiasterms.Weareinterestedinchoosing( ,b)thatminimizesthetraininglossE( ,b).Unfortunately,becauseoftheuseofhiddennodeswithnonlinearactivationfunctions,E( ,b)isnotaconvexfunctionof andb,whichmeansthatE( ,b)canhavelocalminimathatarenotgloballyoptimal.However,wecanstillapplystandardoptimizationtechniquessuchasthegradientdescentmethodtoarriveatalocallyoptimalsolution.Inparticular,theweightparameter andthebiasterm canbeiterativelyupdatedusingthefollowingequations:

where isahyper-parameterknownasthelearningrate.Theintuitionbehindthisequationistomovetheweightsinadirectionthatreducesthetrainingloss.Ifwearriveataminimausingthisprocedure,thegradientofthetraininglosswillbecloseto0,eliminatingthesecondtermandresultingintheconvergenceofweights.TheweightsarecommonlyinitializedwithvaluesdrawnrandomlyfromaGaussianorauniformdistribution.

AnecessarytoolforupdatingweightsinEquation4.53 istocomputethepartialderivativeofEwithrespectto .Thiscomputationisnontrivialespeciallyathiddenlayers ,since doesnotdirectlyaffect (and

yk y^k aLxk

Loss(yk,y^k)=(yk,y^k)2. (4.52)

aL

wijl bil

wijl←wijl−λ∂E∂wijl, (4.53)

bil←bil−λ∂E∂bil, (4.54)

λ

wijl(l<L) wijl y^=aL

hencethetrainingloss),buthascomplexchainsofinfluencesviaactivationvaluesatsubsequentlayers.Toaddressthisproblem,atechniqueknownasbackpropagationwasdeveloped,whichpropagatesthederivativesbackwardfromtheoutputlayertothehiddenlayers.Thistechniquecanbedescribedasfollows.

RecallthatthetraininglossEissimplythesumofindividuallossesattraininginstances.HencethepartialderivativeofEcanbedecomposedasasumofpartialderivativesofindividuallosses.

Tosimplifydiscussions,wewillconsideronlythederivativesofthelossatthek traininginstance,whichwillbegenericallyrepresentedas .Byusingthechainruleofdifferentiation,wecanrepresentthepartialderivativesofthelosswithrespectto as

Thelasttermofthepreviousequationcanbewrittenas

Also,ifweusethesigmoidactivationfunction,then

Equation4.55 canthusbesimplifiedas

∂E∂wjl=∑k=1n∂Loss(yk,y^k)∂wjl.

th Loss(y,aL)

wijl

∂Loss∂wijl=∂Loss∂ail×∂ail∂zil×∂zil∂wijl. (4.55)

∂zil∂wijl=∂(∑jwijlajl−1+bil)∂wijl=ajl−1.

∂ail∂zil=∂σ(zil)∂zil=ail(1−ai1).

∂Loss∂wijl=δil×ail(1−ai1)×ajl−1,whereδil=∂Loss∂ail. (4.56)

Asimilarformulaforthepartialderivativeswithrespecttothebiasterms isgivenby

Hence,tocomputethepartialderivatives,weonlyneedtodetermine .Usingasquaredlossfunction,wecaneasilywrite attheoutputnodeas

However,theapproachforcomputing athiddennodes ismoreinvolved.Noticethat affectstheactivationvalues ofallnodesatthenextlayer,whichinturninfluencestheloss.Hence,againusingthechainruleofdifferentiation, canberepresentedas

Thepreviousequationprovidesaconciserepresentationofthe valuesatlayerlintermsofthe valuescomputedatlayer .Hence,proceedingbackwardfromtheoutputlayerLtothehiddenlayers,wecanrecursivelyapplyEquation4.59 tocompute ateveryhiddennode. canthenbeusedinEquations4.56 and4.57 tocomputethepartialderivativesofthelosswithrespectto and ,respectively.Algorithm4.4 summarizesthecompleteapproachforlearningthemodelparametersofANNusingbackpropagationandgradientdescentmethod.

Algorithm4.4LearningANNusingbackpropagationandgradientdescent.

bli

∂Loss∂bil=δil×ail(1−ai1). (4.57)

δilδL

δL=∂Loss∂aL=∂(y−aL)2∂aL=2(aL−y). (4.58)

δjl (l<L)ajl ail+1

δjl

δjl=∂Loss∂ajl=∑i(∂Loss∂ail+1×∂ail+1∂ajl).=∑i(∂Loss∂ail+1×∂ail+1∂zil+1×∂zil+1(4.59)

δjlδjl+1 l+1

δil δil

wijl bil

∂ ∂ ∂ ∂

∂ ∂ ∑ ∂ ∂

∂ ∂ ∑ ∂ ∂

4.7.3CharacteristicsofANN

1. Multi-layerneuralnetworkswithatleastonehiddenlayerareuniversalapproximators;i.e.,theycanbeusedtoapproximateanytargetfunction.Theyarethushighlyexpressiveandcanbeusedtolearncomplexdecisionboundariesindiverseapplications.ANNcanalsobeusedformulticlassclassificationandregressionproblems,by

appropriatelymodifyingtheoutputlayer.However,thehighmodelcomplexityofclassicalANNmodelsmakesitsusceptibletooverfitting,whichcanbeovercometosomeextentbyusingdeeplearningtechniquesdiscussedinSection4.8.3 .

2. ANNprovidesanaturalwaytorepresentahierarchyoffeaturesatmultiplelevelsofabstraction.TheoutputsatthefinalhiddenlayeroftheANNmodelthusrepresentfeaturesatthehighestlevelofabstractionthataremostusefulforclassification.Thesefeaturescanalsobeusedasinputsinothersupervisedclassificationmodels,e.g.,byreplacingtheoutputnodeoftheANNbyanygenericclassifier.

3. ANNrepresentscomplexhigh-levelfeaturesascompositionsofsimplerlower-levelfeaturesthatareeasiertolearn.ThisprovidesANNtheabilitytograduallyincreasethecomplexityofrepresentations,byaddingmorehiddenlayerstothearchitecture.Further,sincesimplerfeaturescanbecombinedincombinatorialways,thenumberofcomplexfeatureslearnedbyANNismuchlargerthantraditionalclassificationmodels.Thisisoneofthemainreasonsbehindthehighexpressivepowerofdeepneuralnetworks.

4. ANNcaneasilyhandleirrelevantattributes,byusingzeroweightsforattributesthatdonothelpinimprovingthetrainingloss.Also,redundantattributesreceivesimilarweightsanddonotdegradethequalityoftheclassifier.However,ifthenumberofirrelevantorredundantattributesislarge,thelearningoftheANNmodelmaysufferfromoverfitting,leadingtopoorgeneralizationperformance.

5. SincethelearningofANNmodelinvolvesminimizinganon-convexfunction,thesolutionsobtainedbygradientdescentarenotguaranteedtobegloballyoptimal.Forthisreason,ANNhasatendencytogetstuckinlocalminima,achallengethatcanbeaddressedbyusingdeeplearningtechniquesdiscussedinSection4.8.4 .

6. TraininganANNisatimeconsumingprocess,especiallywhenthenumberofhiddennodesislarge.Nevertheless,testexamplescanbe

classifiedrapidly.7. Justlikelogisticregression,ANNcanlearninthepresenceof

interactingvariables,sincethemodelparametersarejointlylearnedoverallvariablestogether.Inaddition,ANNcannothandleinstanceswithmissingvaluesinthetrainingortestingphase.

4.8DeepLearningAsdescribedabove,theuseofhiddenlayersinANNisbasedonthegeneralbeliefthatcomplexhigh-levelfeaturescanbeconstructedbycombiningsimplerlower-levelfeatures.Typically,thegreaterthenumberofhiddenlayers,thedeeperthehierarchyoffeatureslearnedbythenetwork.ThismotivatesthelearningofANNmodelswithlongchainsofhiddenlayers,knownasdeepneuralnetworks.Incontrastto“shallow”neuralnetworksthatinvolveonlyasmallnumberofhiddenlayers,deepneuralnetworksareabletorepresentfeaturesatmultiplelevelsofabstractionandoftenrequirefarfewernodesperlayertoachievegeneralizationperformancesimilartoshallownetworks.

Despitethehugepotentialinlearningdeepneuralnetworks,ithasremainedchallengingtolearnANNmodelswithalargenumberofhiddenlayersusingclassicalapproaches.Apartfromreasonsrelatedtolimitedcomputationalresourcesandhardwarearchitectures,therehavebeenanumberofalgorithmicchallengesinlearningdeepneuralnetworks.First,learningadeepneuralnetworkwithlowtrainingerrorhasbeenadauntingtaskbecauseofthesaturationofsigmoidactivationfunctions,resultinginslowconvergenceofgradientdescent.Thisproblembecomesevenmoreseriousaswemoveawayfromtheoutputnodetothehiddenlayers,becauseofthecompoundedeffectsofsaturationatmultiplelayers,knownasthevanishinggradientproblem.Becauseofthisreason,classicalANNmodelshavesufferedfromslowandineffectivelearning,leadingtopoortrainingandtestperformance.Second,thelearningofdeepneuralnetworksisquitesensitivetotheinitialvaluesofmodelparameters,chieflybecauseofthenon-convexnatureoftheoptimizationfunctionandtheslowconvergenceofgradientdescent.Third,deepneuralnetworkswithalargenumberofhiddenlayershavehighmodel

complexity,makingthemsusceptibletooverfitting.Hence,evenifadeepneuralnetworkhasbeentrainedtoshowlowtrainingerror,itcanstillsufferfrompoorgeneralizationperformance.

Thesechallengeshavedeterredprogressinbuildingdeepneuralnetworksforseveraldecadesanditisonlyrecentlythatwehavestartedtounlocktheirimmensepotentialwiththehelpofanumberofadvancesbeingmadeintheareaofdeeplearning.Althoughsomeoftheseadvanceshavebeenaroundforsometime,theyhaveonlygainedmainstreamattentioninthelastdecade,withdeepneuralnetworkscontinuallybeatingrecordsinvariouscompetitionsandsolvingproblemsthatweretoodifficultforotherclassificationapproaches.

Therearetwofactorsthathaveplayedamajorroleintheemergenceofdeeplearningtechniques.First,theavailabilityoflargerlabeleddatasets,e.g.,theImageNetdatasetcontainsmorethan10millionlabeledimages,hasmadeitpossibletolearnmorecomplexANNmodelsthaneverbefore,withoutfallingeasilyintothetrapsofmodeloverfitting.Second,advancesincomputationalabilitiesandhardwareinfrastructures,suchastheuseofgraphicalprocessingunits(GPU)fordistributedcomputing,havegreatlyhelpedinexperimentingwithdeepneuralnetworkswithlargerarchitecturesthatwouldnothavebeenfeasiblewithtraditionalresources.

Inadditiontotheprevioustwofactors,therehavebeenanumberofalgorithmicadvancementstoovercomethechallengesfacedbyclassicalmethodsinlearningdeepneuralnetworks.Someexamplesincludetheuseofmoreresponsivecombinationsoflossfunctionsandactivationfunctions,betterinitializationofmodelparameters,novelregularizationtechniques,moreagilearchitecturedesigns,andbettertechniquesformodellearningandhyper-parameterselection.Inthefollowing,wedescribesomeofthedeeplearningadvancesmadetoaddressthechallengesinlearningdeepneural

networks.FurtherdetailsonrecentdevelopmentsindeeplearningcanbeobtainedfromtheBibliographicNotes.

4.8.1UsingSynergisticLossFunctions

Oneofthemajorrealizationsleadingtodeeplearninghasbeentheimportanceofchoosingappropriatecombinationsofactivationandlossfunctions.ClassicalANNmodelscommonlymadeuseofthesigmoidactivationfunctionattheoutputlayer,becauseofitsabilitytoproducereal-valuedoutputsbetween0and1,whichwascombinedwithasquaredlossobjectivetoperformgradientdescent.Itwassoonnoticedthatthisparticularcombinationofactivationandlossfunctionresultedinthesaturationofoutputactivationvalues,whichcanbedescribedasfollows.

SaturationofOutputsAlthoughthesigmoidhasbeenwidely-usedasanactivationfunction,iteasilysaturatesathighandlowvaluesofinputsthatarefarawayfrom0.ObservefromFigure4.27(a) that showsvarianceinitsvaluesonlywhenziscloseto0.Forthisreason, isnon-zeroforonlyasmallrangeofzaround0,asshowninFigure4.27(b) .Since isoneofthecomponentsinthegradientofloss(seeEquation4.55 ),wegetadiminishinggradientvaluewhentheactivationvaluesarefarfrom0.

σ(z)∂σ(z)/∂z

∂σ(z)/∂z

Figure4.27.Plotsofsigmoidfunctionanditsderivative.

Toillustratetheeffectofsaturationonthelearningofmodelparametersattheoutputnode,considerthepartialderivativeoflosswithrespecttotheweight

attheoutputnode.Usingthesquaredlossfunction,wecanwritethisas

Inthepreviousequation,noticethatwhen ishighlynegative, (andhencethegradient)iscloseto0.Ontheotherhand,when ishighlypositive, becomescloseto0,nullifyingthevalueofthegradient.Hence,irrespectiveofwhethertheprediction matchesthetruelabelyornot,thegradientofthelosswithrespecttotheweightsiscloseto0wheneverishighlypositiveornegative.Thiscausesanunnecessarilyslow

convergenceofthemodelparametersoftheANNmodel,oftenresultinginpoorlearning.

Notethatitisthecombinationofthesquaredlossfunctionandthesigmoidactivationfunctionattheoutputnodethattogetherresultsindiminishing

wjL

∂Loss∂wjL=2(aL−y)×σ(zL)(1−σ(zL))×ajL−1. (4.60)

zL σ(zL)zL

(1−σ(zL))aL

zL

gradients(andthuspoorlearning)uponsaturationofoutputs.Itisthusimportanttochooseasynergisticcombinationoflossfunctionandactivationfunctionthatdoesnotsufferfromthesaturationofoutputs.

CrossentropylossfunctionThecrossentropylossfunction,whichwasdescribedinthecontextoflogisticregressioninSection4.6.2 ,cansignificantlyavoidtheproblemofsaturatingoutputswhenusedincombinationwiththesigmoidactivationfunction.Thecrossentropylossfunctionofareal-valuedpredictiononadatainstancewithbinarylabel canbedefinedas

wherelogrepresentsthenaturallogarithm(tobasee)and forconvenience.Thecrossentropyfunctionhasfoundationsininformationtheoryandmeasurestheamountofdisagreementbetweenyand .Thepartialderivativeofthislossfunctionwithrespectto canbegivenas

Usingthisvalueof inEquation4.56 ,wecanobtainthepartialderivativeofthelosswithrespecttotheweight attheoutputnodeas

Noticethesimplicityofthepreviousformulausingthecrossentropylossfunction.Thepartialderivativesofthelosswithrespecttotheweightsattheoutputnodedependonlyonthedifferencebetweentheprediction andthetruelabely.IncontrasttoEquation4.60 ,itdoesnotinvolvetermssuchas

thatcanbeimpactedbysaturationof .Hence,thegradients

y^∈(0,1)y∈{0,1}

Loss(y,y^)=−ylog(y^)−(1−y)log(1−y^), (4.61)

0log(0)=0

y^y^=aL

δL=∂Loss∂aL=−yaL+(1−y)(1−aL).=(aL−y)aL(1−aL). (4.62)

δLwjl

∂Loss∂wjL=(aL−y)aL(1−aL)×aL(1−aL)×ajL−1.=(aL−y)×ajL−1. (4.63)

aL

σ(zL)(1−σ(zL)) zL

arehighwhenever islarge,promotingeffectivelearningofthemodelparametersattheoutputnode.ThishasbeenamajorbreakthroughinthelearningofmodernANNmodelsanditisnowacommonpracticetousethecrossentropylossfunctionwithsigmoidactivationsattheoutputnode.

4.8.2UsingResponsiveActivationFunctions

Eventhoughthecrossentropylossfunctionhelpsinovercomingtheproblemofsaturatingoutputs,itstilldoesnotsolvetheproblemofsaturationathiddenlayers,arisingduetotheuseofsigmoidactivationfunctionsathiddennodes.Infact,theeffectofsaturationonthelearningofmodelparametersisevenmoreaggravatedathiddenlayers,aproblemknownasthevanishinggradientproblem.Inthefollowing,wedescribethevanishinggradientproblemandtheuseofamoreresponsiveactivationfunction,calledtherectifiedlinearoutputunit(ReLU),toovercomethisproblem.

VanishingGradientProblemTheimpactofsaturatingactivationvaluesonthelearningofmodelparametersincreasesatdeeperhiddenlayersthatarefartherawayfromtheoutputnode.Eveniftheactivationintheoutputlayerdoesnotsaturate,therepeatedmultiplicationsperformedaswebackpropagatethegradientsfromtheoutputlayertothehiddenlayersmayleadtodecreasinggradientsinthehiddenlayers.Thisiscalledthevanishinggradientproblem,whichhasbeenoneofthemajorhindrancesinlearningdeepneuralnetworks.

(aL−y)

Toillustratethevanishinggradientproblem,consideranANNmodelthatconsistsofasinglenodeateveryhiddenlayerofthenetwork,asshowninFigure4.28 .Thissimplifiedarchitectureinvolvesasinglechainofhiddennodeswhereasingleweightedlink connectsthenodeatlayer tothenodeatlayerl.UsingEquations4.56 and4.59 ,wecanrepresentthepartialderivativeofthelosswithrespectto as

Noticethatifanyofthelinearpredictors saturatesatsubsequentlayers,thentheterm becomescloseto0,thusdiminishingtheoverallgradient.Thesaturationofactivationsthusgetscompoundedandhasmultiplicativeeffectsonthegradientsathiddenlayers,makingthemhighlyunstableandthus,unsuitableforusewithgradientdescent.Eventhoughthepreviousdiscussiononlypertainstothesimplifiedarchitectureinvolvingasinglechainofhiddennodes,asimilarargumentcanbemadeforanygenericANNarchitectureinvolvingmultiplechainsofhiddennodes.Notethatthevanishinggradientproblemprimarilyarisesbecauseoftheuseofsigmoidactivationfunctionathiddennodes,whichisknowntoeasilysaturateespeciallyafterrepeatedmultiplications.

Figure4.28.AnexampleofanANNmodelwithonlyonenodeateveryhiddenlayer.

wl l−1

wl

∂Loss∂wl=δl×al(1−al)×al−1,whereδl=2(aL−y)×∏r=lL−1(ar+1(1−ar+1)×wr+1).(4.64)

zr+1ar+1(1−ar+1)

Figure4.29.Plotoftherectifiedlinearunit(ReLU)activationfunction.

RectifiedLinearUnits(ReLU)Toovercomethevanishinggradientproblem,itisimportanttouseanactivationfunctionf(z)atthehiddennodesthatprovidesastableandsignificantvalueofthegradientwheneverahiddennodeisactive,i.e., .Thisisachievedbyusingrectifiedlinearunits(ReLU)asactivationfunctionsathiddennodes,whichcanbedefinedas

TheideaofReLUhasbeeninspiredfrombiologicalneurons,whichareeitherinaninactivestate orshowanactivationvalueproportionaltotheinput.Figure4.29 showsaplotoftheReLUfunction.Wecanseethatitislinearwithrespecttozwhen .Hence,thegradientoftheactivationvaluewithrespecttozcanbewrittenas

z>0

a=f(z)={z,ifz>0.0,otherwise. (4.65)

(f(z)=0)

z>0

Althoughf(z)isnotdifferentiableat0,itiscommonpracticetousewhen .SincethegradientoftheReLUactivationfunctionisequalto1whenever ,itavoidstheproblemofsaturationathiddennodes,evenafterrepeatedmultiplications.UsingReLU,thepartialderivativesofthelosswithrespecttotheweightandbiasparameterscanbegivenby

NoticethatReLUshowsalinearbehaviorintheactivationvalueswheneveranodeisactive,ascomparedtothenonlinearpropertiesofthesigmoidfunction.Thislinearitypromotesbetterflowsofgradientsduringbackpropagation,andthussimplifiesthelearningofANNmodelparameters.TheReLUisalsohighlyresponsiveatlargevaluesofzawayfrom0,asopposedtothesigmoidactivationfunction,makingitmoresuitableforgradientdescent.ThesedifferencesgiveReLUamajoradvantageoverthesigmoidfunction.Indeed,ReLUisusedasthepreferredchoiceofactivationfunctionathiddenlayersinmostmodernANNmodels.

4.8.3Regularization

AmajorchallengeinlearningdeepneuralnetworksisthehighmodelcomplexityofANNmodels,whichgrowswiththeadditionofhiddenlayersinthenetwork.Thiscanbecomeaseriousconcern,especiallywhenthetrainingsetissmall,duetothephenomenaofmodeloverfitting.Toovercomethis

∂a∂z={1,ifz>0.0,ifz<0. (4.66)

∂a/∂z=0z=0

z>0

∂Loss∂wijl=δil×I(zil)×ajl−1, (4.67)

∂Loss∂bil=δil×I(zil),whereδil=∑i=1n(δil+1×I(zil+1)×wijl+1),andI(z)={1,ifz>0.0,otherwise.

(4.68)

challenge,itisimportanttousetechniquesthatcanhelpinreducingthecomplexityofthelearnedmodel,knownasregularizationtechniques.ClassicalapproachesforlearningANNmodelsdidnothaveaneffectivewaytopromoteregularizationofthelearnedmodelparameters.Hence,theyhadoftenbeensidelinedbyotherclassificationmethods,suchassupportvectormachines(SVM),whichhavein-builtregularizationmechanisms.(SVMswillbediscussedinmoredetailinSection4.9 ).

OneofthemajoradvancementsindeeplearninghasbeenthedevelopmentofnovelregularizationtechniquesforANNmodelsthatareabletooffersignificantimprovementsingeneralizationperformance.Inthefollowing,wediscussoneoftheregularizationtechniquesforANN,knownasthedropoutmethod,thathavegainedalotofattentioninseveralapplications.

DropoutThemainobjectiveofdropoutistoavoidthelearningofspuriousfeaturesathiddennodes,occurringduetomodeloverfitting.Itusesthebasicintuitionthatspuriousfeaturesoften“co-adapt”themselvessuchthattheyshowgoodtrainingperformanceonlywhenusedinhighlyselectivecombinations.Ontheotherhand,relevantfeaturescanbeusedinadiversityoffeaturecombinationsandhencearequiteresilienttotheremovalormodificationofotherfeatures.Thedropoutmethodusesthisintuitiontobreakcomplex“co-adaptations”inthelearnedfeaturesbyrandomlydroppinginputandhiddennodesinthenetworkduringtraining.

Dropoutbelongstoafamilyofregularizationtechniquesthatusesthecriteriaofresiliencetorandomperturbationsasameasureoftherobustness(andhence,simplicity)ofamodel.Forexample,oneapproachtoregularizationistoinjectnoiseintheinputattributesofthetrainingsetandlearnamodelwiththenoisytraininginstances.Ifafeaturelearnedfromthetrainingdatais

indeedgeneralizable,itshouldnotbeaffectedbytheadditionofnoise.Dropoutcanbeviewedasasimilarregularizationapproachthatperturbstheinformationcontentofthetrainingsetnotonlyatthelevelofattributesbutalsoatmultiplelevelsofabstractions,bydroppinginputandhiddennodes.

Thedropoutmethoddrawsinspirationfromthebiologicalprocessofgeneswappinginsexualreproduction,wherehalfofthegenesfrombothparentsarecombinedtogethertocreatethegenesoftheoffspring.Thisfavorstheselectionofparentgenesthatarenotonlyusefulbutcanalsointer-minglewithdiversecombinationsofgenescomingfromtheotherparent.Ontheotherhand,co-adaptedgenesthatfunctiononlyinhighlyselectivecombinationsaresooneliminatedintheprocessofevolution.Thisideaisusedinthedropoutmethodforeliminatingspuriousco-adaptedfeatures.Asimplifieddescriptionofthedropoutmethodisprovidedintherestofthissection.

Figure4.30.Examplesofsub-networksgeneratedinthedropoutmethodusing .

Let representthemodelparametersoftheANNmodelatthekiterationofthegradientdescentmethod.Ateveryiteration,werandomlyselectafraction ofinputandhiddennodestobedroppedfromthenetwork,where isahyper-parameterthatistypicallychosentobe0.5.Theweightedlinksandbiastermsinvolvingthedroppednodesaretheneliminated,resultingina“thinned”sub-networkofsmallersize.Themodelparametersofthesub-network arethenupdatedbycomputingactivationvaluesandperformingbackpropagationonthissmallersub-network.Theseupdatedvaluesarethenaddedbackintheoriginalnetworkto

γ=0.5

(wk,bk) th

γγ∈(0,1)

(wsk,bsk)

obtaintheupdatedmodelparameters, ,tobeusedinthenextiteration.

Figure4.30 showssomeexamplesofsub-networksthatcanbegeneratedatdifferentiterationsofthedropoutmethod,byrandomlydroppinginputandhiddennodes.Sinceeverysub-networkhasadifferentarchitecture,itisdifficulttolearncomplexco-adaptationsinthefeaturesthatcanresultinoverfitting.Instead,thefeaturesatthehiddennodesarelearnedtobemoreagiletorandommodificationsinthenetworkstructure,thusimprovingtheirgeneralizationability.Themodelparametersareupdatedusingadifferentrandomsub-networkateveryiteration,tillthegradientdescentmethodconverges.

Let denotethemodelparametersatthelastiterationofthegradientdescentmethod.Theseparametersarefinallyscaleddownbyafactorof ,toproducetheweightsandbiastermsofthefinalANNmodel,asfollows:

Wecannowusethecompleteneuralnetworkwithmodelparametersfortesting.ThedropoutmethodhasbeenshowntoprovidesignificantimprovementsinthegeneralizationperformanceofANNmodelsinanumberofapplications.Itiscomputationallycheapandcanbeappliedincombinationwithanyoftheotherdeeplearningtechniques.Italsohasanumberofsimilaritieswithawidely-usedensemblelearningmethodknownasbagging,whichlearnsmultiplemodelsusingrandomsubsetsofthetrainingset,andthenusestheaverageoutputofallthemodelstomakepredictions.(BaggingwillbepresentedinmoredetaillaterinSection4.10.4 ).Inasimilarvein,itcanbeshownthatthepredictionsofthefinalnetworklearnedusingdropoutapproximatestheaverageoutputofallpossible sub-networksthatcanbe

(wk+1,bk+1)

(wkmax,bkmax) kmax

(1−γ)

(w*,b*)=((1−γ)×wkmax,(1−γ)×bkmax)

(w*,b*)

2n

formedusingnnodes.Thisisoneofthereasonsbehindthesuperiorregularizationabilitiesofdropout.

4.8.4InitializationofModelParameters

Becauseofthenon-convexnatureofthelossfunctionusedbyANNmodels,itispossibletogetstuckinlocallyoptimalbutgloballyinferiorsolutions.Hence,theinitialchoiceofmodelparametervaluesplaysasignificantroleinthelearningofANNbygradientdescent.Theimpactofpoorinitializationisevenmoreaggravatedwhenthemodeliscomplex,thenetworkarchitectureisdeep,ortheclassificationtaskisdifficult.Insuchcases,itisoftenadvisabletofirstlearnasimplermodelfortheproblem,e.g.,usingasinglehiddenlayer,andthenincrementallyincreasethecomplexityofthemodel,e.g.,byaddingmorehiddenlayers.Analternateapproachistotrainthemodelforasimplertaskandthenusethelearnedmodelparametersasinitialparameterchoicesinthelearningoftheoriginaltask.TheprocessofinitializingANNmodelparametersbeforetheactualtrainingprocessisknownaspretraining.

Pretraininghelpsininitializingthemodeltoasuitableregionintheparameterspacethatwouldotherwisebeinaccessiblebyrandominitialization.Pretrainingalsoreducesthevarianceinthemodelparametersbyfixingthestartingpointofgradientdescent,thusreducingthechancesofoverfittingduetomultiplecomparisons.Themodelslearnedbypretrainingarethusmoreconsistentandprovidebettergeneralizationperformance.

SupervisedPretrainingAcommonapproachforpretrainingistoincrementallytraintheANNmodelinalayer-wisemanner,byaddingonehiddenlayeratatime.Thisapproach,

knownassupervisedpretraining,ensuresthattheparameterslearnedateverylayerareobtainedbysolvingasimplerproblem,ratherthanlearningallmodelparameterstogether.TheseparametervaluesthusprovideagoodchoiceforinitializingtheANNmodel.Theapproachforsupervisedpretrainingcanbebrieflydescribedasfollows.

WestartthesupervisedpretrainingprocessbyconsideringareducedANNmodelwithonlyasinglehiddenlayer.Byapplyinggradientdescentonthissimplemodel,weareabletolearnthemodelparametersofthefirsthiddenlayer.Atthenextrun,weaddanotherhiddenlayertothemodelandapplygradientdescenttolearntheparametersofthenewlyaddedhiddenlayer,whilekeepingtheparametersofthefirstlayerfixed.Thisprocedureisrecursivelyappliedsuchthatwhilelearningtheparametersofthel hiddenlayer,weconsiderareducedmodelwithonlylhiddenlayers,whosefirsthiddenlayersarenotupdatedonthel runbutareinsteadfixedusingpretrainedvaluesfrompreviousruns.Inthisway,weareabletolearnthemodelparametersofall hiddenlayers.ThesepretrainedvaluesareusedtoinitializethehiddenlayersofthefinalANNmodel,whichisfine-tunedbyapplyingafinalroundofgradientdescentoverallthelayers.

UnsupervisedPretrainingSupervisedpretrainingprovidesapowerfulwaytoinitializemodelparameters,bygraduallygrowingthemodelcomplexityfromshallowertodeepernetworks.However,supervisedpretrainingrequiresasufficientnumberoflabeledtraininginstancesforeffectiveinitializationoftheANNmodel.Analternatepretrainingapproachisunsupervisedpretraining,whichinitializesmodelparametersbyusingunlabeledinstancesthatareoftenabundantlyavailable.ThebasicideaofunsupervisedpretrainingistoinitializetheANN

th

(l−1)th

(L−1)

modelinsuchawaythatthelearnedfeaturescapturethelatentstructureintheunlabeleddata.

Figure4.31.Thebasicarchitectureofasingle-layerautoencoder.

Unsupervisedpretrainingreliesontheassumptionthatlearningthedistributionoftheinputdatacanindirectlyhelpinlearningtheclassificationmodel.Itismosthelpfulwhenthenumberoflabeledexamplesissmallandthefeaturesforthesupervisedproblembearresemblancetothefactorsgeneratingtheinputdata.Unsupervisedpretrainingcanbeviewedasadifferentformofregularization,wherethefocusisnotexplicitlytowardfindingsimplerfeaturesbutinsteadtowardfindingfeaturesthatcanbestexplaintheinputdata.Historically,unsupervisedpretraininghasplayedanimportantroleinrevivingtheareaofdeeplearning,bymakingitpossibletotrainanygenericdeepneuralnetworkwithoutrequiringspecializedarchitectures.

UseofAutoencoders

OnesimpleandcommonlyusedapproachforunsupervisedpretrainingistouseanunsupervisedANNmodelknownasanautoencoder.ThebasicarchitectureofanautoencoderisshowninFigure4.31 .Anautoencoderattemptstolearnareconstructionoftheinputdatabymappingtheattributestolatentfeatures ,andthenre-projecting backtotheoriginalattribute

spacetocreatethereconstruction .Thelatentfeaturesarerepresentedusingahiddenlayerofnodes,whiletheinputandoutputlayersrepresenttheattributesandcontainthesamenumberofnodes.Duringtraining,thegoalistolearnanautoencodermodelthatprovidesthelowestreconstructionerror,

,onallinputdatainstances.Atypicalchoiceofthereconstructionerroristhesquaredlossfunction:

ThemodelparametersoftheautoencodercanbelearnedbyusingasimilargradientdescentmethodastheoneusedforlearningsupervisedANNmodelsforclassification.Thekeydifferenceistheuseofthereconstructionerroronalltraininginstancesasthetrainingloss.Autoencodersthathavemultiplelayersofhiddenlayersareknownasstackedautoencoders.

Autoencodersareabletocapturecomplexrepresentationsoftheinputdatabytheuseofhiddennodes.However,ifthenumberofhiddennodesislarge,itispossibleforanautoencodertolearntheidentityrelationship,wheretheinput isjustcopiedandreturnedastheoutput ,resultinginatrivialsolution.Forexample,ifweuseasmanyhiddennodesasthenumberofattributes,thenitispossibleforeveryhiddennodetocopyanattributeandsimplypassitalongtoanoutputnode,withoutextractinganyusefulinformation.Toavoidthisproblem,itiscommonpracticetokeepthenumberofhiddennodessmallerthanthenumberofinputattributes.Thisforcestheautoencodertolearnacompactandusefulencodingoftheinputdata,similartoadimensionalityreductiontechnique.Analternateapproachistocorrupt

x^

RE(x,x^)

RE(x,x^)=ǁx−x^ǁ2.

x^

theinputinstancesbyaddingrandomnoise,andthenlearntheautoencodertoreconstructtheoriginalinstancefromthenoisyinput.Thisapproachisknownasthedenoisingautoencoder,whichoffersstrongregularizationcapabilitiesandisoftenusedtolearncomplexfeatureseveninthepresenceofalargenumberofhiddennodes.

Touseanautoencoderforunsupervisedpretraining,wecanfollowasimilarlayer-wiseapproachlikesupervisedpretraining.Inparticular,topretrainthemodelparametersofthel hiddenlayer,wecanconstructareducedANNmodelwithonlylhiddenlayersandanoutputlayercontainingthesamenumberofnodesastheattributesandisusedforreconstruction.Theparametersofthel hiddenlayerofthisnetworkarethenlearnedusingagradientdescentmethodtominimizethereconstructionerror.Theuseofunlabeleddatacanbeviewedasprovidinghintstothelearningofparametersateverylayerthataidingeneralization.ThefinalmodelparametersoftheANNmodelarethenlearnedbyapplyinggradientdescentoverallthelayers,usingtheinitialvaluesofparametersobtainedfrompretraining.

HybridPretrainingUnsupervisedpretrainingcanalsobecombinedwithsupervisedpretrainingbyusingtwooutputlayersateveryrunofpretraining,oneforreconstructionandtheotherforsupervisedclassification.Theparametersofthel hiddenlayerarethenlearnedbyjointlyminimizingthelossesonbothoutputlayers,usuallyweightedbyatrade-offhyper-parameter .Suchacombinedapproachoftenshowsbettergeneralizationperformancethaneitheroftheapproaches,sinceitprovidesawaytobalancebetweenthecompetingobjectivesofrepresentingtheinputdataandimprovingclassificationperformance.

th

th

th

α

4.8.5CharacteristicsofDeepLearning

ApartfromthebasiccharacteristicsofANNdiscussedinSection4.7.3 ,theuseofdeeplearningtechniquesprovidesthefollowingadditionalcharacteristics:

1. AnANNmodeltrainedforsometaskcanbeeasilyre-usedforadifferenttaskthatinvolvesthesameattributes,byusingpretrainingstrategies.Forexample,wecanusethelearnedparametersoftheoriginaltaskasinitialparameterchoicesforthetargettask.Inthisway,ANNpromotesre-usabilityoflearning,whichcanbequiteusefulwhenthetargetapplicationhasasmallernumberoflabeledtraininginstances.

2. Deeplearningtechniquesforregularization,suchasthedropoutmethod,helpinreducingthemodelcomplexityofANNandthuspromotinggoodgeneralizationperformance.Theuseofregularizationtechniquesisespeciallyusefulinhigh-dimensionalsettings,wherethenumberoftraininglabelsissmallbuttheclassificationproblemisinherentlydifficult.

3. Theuseofanautoencoderforpretrainingcanhelpeliminateirrelevantattributesthatarenotrelatedtootherattributes.Further,itcanhelpreducetheimpactofredundantattributesbyrepresentingthemascopiesofthesameattribute.

4. AlthoughthelearningofanANNmodelcansuccumbtofindinginferiorandlocallyoptimalsolutions,thereareanumberofdeeplearningtechniquesthathavebeenproposedtoensureadequatelearningofanANN.Apartfromthemethodsdiscussedinthissection,someothertechniquesinvolvenovelarchitecturedesignssuchasskipconnectionsbetweentheoutputlayerandlowerlayers,whichaidstheeasyflowofgradientsduringbackpropagation.

5. AnumberofspecializedANNarchitectureshavebeendesignedtohandleavarietyofinputdatasets.Someexamplesincludeconvolutionalneuralnetworks(CNN)fortwo-dimensionalgriddedobjectssuchasimages,andrecurrentneuralnetwork(RNN)forsequences.WhileCNNshavebeenextensivelyusedintheareaofcomputervision,RNNshavefoundapplicationsinprocessingspeechandlanguage.

4.9SupportVectorMachine(SVM)Asupportvectormachine(SVM)isadiscriminativeclassificationmodelthatlearnslinearornonlineardecisionboundariesintheattributespacetoseparatetheclasses.Apartfrommaximizingtheseparabilityofthetwoclasses,SVMoffersstrongregularizationcapabilities,i.e.,itisabletocontrolthecomplexityofthemodelinordertoensuregoodgeneralizationperformance.Duetoitsuniqueabilitytoinnatelyregularizeitslearning,SVMisabletolearnhighlyexpressivemodelswithoutsufferingfromoverfitting.Ithasthusreceivedconsiderableattentioninthemachinelearningcommunityandiscommonlyusedinseveralpracticalapplications,rangingfromhandwrittendigitrecognitiontotextcategorization.SVMhasstrongrootsinstatisticallearningtheoryandisbasedontheprincipleofstructuralriskminimization.AnotheruniqueaspectofSVMisthatitrepresentsthedecisionboundaryusingonlyasubsetofthetrainingexamplesthataremostdifficulttoclassify,knownasthesupportvectors.Hence,itisadiscriminativemodelthatisimpactedonlybytraininginstancesneartheboundaryofthetwoclasses,incontrasttolearningthegenerativedistributionofeveryclass.

ToillustratethebasicideabehindSVM,wefirstintroducetheconceptofthemarginofaseparatinghyperplaneandtherationaleforchoosingsuchahyperplanewithmaximummargin.WethendescribehowalinearSVMcanbetrainedtoexplicitlylookforthistypeofhyperplane.WeconcludebyshowinghowtheSVMmethodologycanbeextendedtolearnnonlineardecisionboundariesbyusingkernelfunctions.

4.9.1MarginofaSeparating

Hyperplane

Thegenericequationofaseparatinghyperplanecanbewrittenas

where representstheattributesand( , )representtheparametersofthehyperplane.Adatainstance canbelongtoeithersideofthehyperplanedependingonthesignof .Forthepurposeofbinaryclassification,weareinterestedinfindingahyperplanethatplacesinstancesofbothclassesonoppositesidesofthehyperplane,thusresultinginaseparationofthetwoclasses.Ifthereexistsahyperplanethatcanperfectlyseparatetheclassesinthedataset,wesaythatthedatasetislinearlyseparable.Figure4.32showsanexampleoflinearlyseparabledatainvolvingtwoclasses,squaresandcircles.Notethattherecanbeinfinitelymanyhyperplanesthatcanseparatetheclasses,twoofwhichareshowninFigure4.32 aslinesand .Eventhougheverysuchhyperplanewillhavezerotrainingerror,theycanprovidedifferentresultsonpreviouslyunseeninstances.Whichseparatinghyperplaneshouldwethusfinallychoosetoobtainthebestgeneralizationperformance?Ideally,wewouldliketochooseasimplehyperplanethatisrobusttosmallperturbations.Thiscanbeachievedbyusingtheconceptofthemarginofaseparatinghyperplane,whichcanbebrieflydescribedasfollows.

wTx+b=0,

xi(wTxi+b)

B1B2

Figure4.32.Marginofahyperplaneinatwo-dimensionaldataset.

Foreveryseparatinghyperplane ,letusassociateapairofparallelhyperplanes, and ,suchthattheytouchtheclosestinstancesofbothclasses,respectively.Forexample,ifwemove paralleltoitsdirection,wecantouchthefirstsquareusing andthefirstcircleusing . andareknownasthemarginhyperplanesof andthedistancebetweenthemisknownasthemarginoftheseparatinghyperplane .FromthediagramshowninFigure4.32 ,noticethatthemarginfor isconsiderablylargerthanthatfor .Inthisexample, turnsouttobetheseparatinghyperplanewiththemaximummargin,knownasthemaximummarginhyperplane.

RationaleforMaximumMargin

Bibi1 bi2

B1b11 b12 bi1 bi2

BiBi

B1B2 b1

Hyperplaneswithlargemarginstendtohavebettergeneralizationperformancethanthosewithsmallmargins.Intuitively,ifthemarginissmall,thenanyslightperturbationinthehyperplaneorthetraininginstanceslocatedattheboundarycanhavequiteanimpactontheclassificationperformance.Smallmarginhyperplanesarethusmoresusceptibletooverfitting,astheyarebarelyabletoseparatetheclasseswithaverynarrowroomtoallowperturbations.Ontheotherhand,ahyperplanethatisfartherawayfromtraininginstancesofbothclasseshassufficientleewaytoberobusttominormodificationsinthedata,andthusshowssuperiorgeneralizationperformance.

Theideaofchoosingthemaximummarginseparatinghyperplanealsohasstrongfoundationsinstatisticallearningtheory.ItcanbeshownthatthemarginofsuchahyperplaneisinverselyrelatedtotheVC-dimensionoftheclassifier,whichisacommonlyusedmeasureofthecomplexityofamodel.AsdiscussedinSection3.4 ofthelastchapter,asimplermodelshouldbepreferredoveramorecomplexmodeliftheybothshowsimilartrainingperformance.Hence,maximizingthemarginresultsintheselectionofaseparatinghyperplanewiththelowestmodelcomplexity,whichisexpectedtoshowbettergeneralizationperformance.

4.9.2LinearSVM

AlinearSVMisaclassifierthatsearchesforaseparatinghyperplanewiththelargestmargin,whichiswhyitisoftenknownasamaximalmarginclassifier.ThebasicideaofSVMcanbedescribedasfollows.

Considerabinaryclassificationproblemconsistingofntraininginstances,whereeverytraininginstance isassociatedwithabinarylabel .xi yi∈{−1,1}

Let betheequationofaseparatinghyperplanethatseparatesthetwoclassesbyplacingthemonoppositesides.Thismeansthat

Thedistanceofanypoint fromthehyperplaneisthengivenby

where denotestheabsolutevalueand denotesthelengthofavector.Letthedistanceoftheclosestpointfromthehyperplanewith be .Similarly,let denotethedistanceoftheclosestpointfromclass .

Thiscanberepresentedusingthefollowingconstraints:

Thepreviousequationscanbesuccinctlyrepresentedbyusingtheproductofand as

whereMisaparameterrelatedtothemarginofthehyperplane,i.e.,if,thenmargin .Inordertofindthemaximummargin

hyperplanethatadherestothepreviousconstraints,wecanconsiderthefollowingoptimizationproblem:

Tofindthesolutiontothepreviousproblem,notethatif andbsatisfytheconstraintsofthepreviousproblem,thenanyscaledversionof andbwould

wTx+b=0

wTxi+b>0ifyi=1,wTxi+b<0ifyi=−1.

D(x)=|wTx+b|ǁwǁ

|⋅| ǁ⋅ǁy=1 k+>0

k−>0 −1

wTxi+bǁwǁ≥k+ifyi=1,wTxi+bǁwǁ≤−k−ifyi=−1, (4.69)

yi (wTxi+b)

yi(wTxi+b)≥Mǁwǁ (4.70)

k+=k−=M =k+−k−=2M

maxw,bMsubjecttoyi(wTxi+b)≥Mǁwǁ. (4.71)

satisfythemtoo.Hence,wecanconvenientlychoose tosimplifytheright-handsideoftheinequalities.Furthermore,maximizingMamountstominimizing .Hence,theoptimizationproblemofSVMiscommonlyrepresentedinthefollowingform:

LearningModelParametersEquation4.72 representsaconstrainedoptimizationproblemwithlinearinequalities.Sincetheobjectivefunctionisconvexandquadraticwithrespectto ,itisknownasaquadraticprogrammingproblem(QPP),whichcanbesolvedusingstandardoptimizationtechniques,asdescribedinAppendixE.Inthefollowing,wepresentabriefsketchofthemainideasforlearningthemodelparametersofSVM.

First,werewritetheobjectivefunctioninaformthattakesintoaccounttheconstraintsimposedonitssolutions.ThenewobjectivefunctionisknownastheLagrangianprimalproblem,whichcanberepresentedasfollows,

wheretheparameters correspondtotheconstraintsandarecalledtheLagrangemultipliers.Next,tominimizetheLagrangian,wetakethederivativeof withrespectto andbandsetthemequaltozero:

ǁwǁ=1/M

ǁwǁ2

minw,bǁwǁ22subjecttoyi(wTxi+b)≥1. (4.72)

LP=12ǁwǁ2−∑i=1nλi(yi(wTxi+b)−1), (4.73)

λi≥0

LP

∂LP∂w=0⇒w=∑i=1nλiyixi, (4.74)

∂LP∂b=0⇒∑i=1nλiyi=0. (4.75)

NotethatusingEquation4.74 ,wecanrepresent completelyintermsoftheLagrangemultipliers.Thereisanotherrelationshipbetween( ,b)andthatisderivedfromtheKarush-Kuhn-Tucker(KKT)conditions,acommonlyusedtechniqueforsolvingQPP.Thisrelationshipcanbedescribedas

Equation4.76 isknownasthecomplementaryslacknesscondition,whichshedslightonavaluablepropertyofSVM.ItstatesthattheLagrangemultiplier isstrictlygreaterthan0onlywhen satisfiestheequation

,whichmeansthat liesexactlyonamarginhyperplane.However,if isfartherawayfromthemarginhyperplanessuchthat

,then isnecessarily0.Hence, foronlyasmallnumberofinstancesthatareclosesttotheseparatinghyperplane,whichareknownassupportvectors.Figure4.33 showsthesupportvectorsofahyperplaneasfilledcirclesandsquares.Further,ifwelookatEquation4.74 ,wewillobservethattraininginstanceswith donotcontributetotheweightparameter .Thissuggeststhat canbeconciselyrepresentedonlyintermsofthesupportvectorsinthetrainingdata,whicharequitefewerthantheoverallnumberoftraininginstances.Thisabilitytorepresentthedecisionfunctiononlyintermsofthesupportvectorsiswhatgivesthisclassifierthenamesupportvectormachines.

λi

λi[yi(wTxi+b)−1]=0. (4.76)

λi xiyi(w⋅xi+b)=1 xi

xiyi(w⋅xi+b)>1 λi λi>0

λi=0

Figure4.33.Supportvectorsofahyperplaneshownasfilledcirclesandsquares.

Usingequations4.74 ,4.75 ,and4.76 inEquation4.73 ,weobtainthefollowingoptimizationproblemintermsoftheLagrangemultipliers :

Thepreviousoptimizationproblemiscalledthedualoptimizationproblem.Maximizingthedualproblemwithrespectto isequivalenttominimizingtheprimalproblemwithrespectto andb.

Thekeydifferencesbetweenthedualandprimalproblemsareasfollows:

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,λi≥0. (4.77)

λi

1. Solvingthedualproblemhelpsusidentifythesupportvectorsinthedatathathavenon-zerovaluesof .Further,thesolutionofthedualproblemisinfluencedonlybythesupportvectorsthatareclosesttothedecisionboundaryofSVM.ThishelpsinsummarizingthelearningofSVMsolelyintermsofitssupportvectors,whichareeasiertomanagecomputationally.Further,itrepresentsauniqueabilityofSVMtobedependentonlyontheinstancesclosesttotheboundary,whicharehardertoclassify,ratherthanthedistributionofinstancesfartherawayfromtheboundary.

2. Theobjectiveofthedualprobleminvolvesonlytermsoftheform ,whicharebasicallyinnerproductsintheattributespace.AswewillseelaterinSection4.9.4 ,thispropertywillprovetobequiteusefulinlearningnonlineardecisionboundariesusingSVM.

Becauseofthesedifferences,itisusefultosolvethedualoptimizationproblemusinganyofthestandardsolversforQPP.Havingfoundanoptimalsolutionfor ,wecanuseEquation4.74 tosolvefor .WecanthenuseEquation4.76 onthesupportvectorstosolveforbasfollows:

whereSrepresentsthesetofsupportvectors and isthenumberofsupportvectors.Themaximummarginhyperplanecanthenbeexpressedas

Usingthisseparatinghyperplane,atestinstance canbeassignedaclasslabelusingthesignoff( ).

λi

xiTxj

λi

b=1nS∑i∈S1−yiwTxiyi (4.78)

(S={i|λi>0}) nS

f(x)=(∑i=1nλiyixiTx)+b=0. (4.79)

Example4.7.Considerthetwo-dimensionaldatasetshowninFigure4.34 ,whichcontainseighttraininginstances.Usingquadraticprogramming,wecansolvetheoptimizationproblemstatedinEquation4.77 toobtaintheLagrangemultiplier foreachtraininginstance.TheLagrangemultipliersaredepictedinthelastcolumnofthetable.Noticethatonlythefirsttwoinstanceshavenon-zeroLagrangemultipliers.Theseinstancescorrespondtothesupportvectorsforthisdataset.

Let andbdenotetheparametersofthedecisionboundary.UsingEquation4.74 ,wecansolvefor and inthefollowingway:

λi

w=(w1,w2)w1 w2

w1=∑iλiyixi1=65.5261×1×0.3858+65.5261×−1×0.4871=−6.64.w2=∑iλiyixi2=65.5261×1×0.4687+65.5261×−1×0.611=−9.32.

Figure4.34.Exampleofalinearlyseparabledataset.

ThebiastermbcanbecomputedusingEquation4.76 foreachsupportvector:

Averagingthesevalues,weobtain .ThedecisionboundarycorrespondingtotheseparametersisshowninFigure4.34 .

4.9.3Soft-marginSVM

Figure4.35 showsadatasetthatissimilartoFigure4.32 ,exceptithastwonewexamples,PandQ.Althoughthedecisionboundary misclassifiesthenewexamples,while classifiesthemcorrectly,thisdoesnotmeanthat

isabetterdecisionboundarythan becausethenewexamplesmaycorrespondtonoiseinthetrainingdata. shouldstillbepreferredoverbecauseithasawidermargin,andthus,islesssusceptibletooverfitting.However,theSVMformulationpresentedintheprevioussectiononlyconstructsdecisionboundariesthataremistake-free.

b(1)=1−w⋅x1=1−(−6.64)(0.3858)−(−9.32)(0.4687)=7.9300.b(2)=1−w⋅x2=−1−(−6.64)(0.4871)−(−9.32)(0.611)=7.9289.

b=7.93

B1B2

B2 B1B1 B2

Figure4.35.DecisionboundaryofSVMforthenon-separablecase.

ThissectionexamineshowtheformulationofSVMcanbemodifiedtolearnaseparatinghyperplanethatistolerabletosmallnumberoftrainingerrorsusingamethodknownasthesoft-marginapproach.Moreimportantly,themethodpresentedinthissectionallowsSVMtolearnlinearhyperplaneseveninsituationswheretheclassesarenotlinearlyseparable.Todothis,thelearningalgorithminSVMmustconsiderthetrade-offbetweenthewidthofthemarginandthenumberoftrainingerrorscommittedbythelinearhyperplane.

TointroducetheconceptoftrainingerrorsintheSVMformulation,letusrelaxtheinequalityconstraintstoaccommodateforsomeviolationsonasmallnumberoftraininginstances.Thiscanbedonebyintroducingaslackvariable foreverytraininginstance asfollows:ξ≥0 xi

Thevariable allowsforsomeslackintheinequalitiesoftheSVMsuchthateveryinstance doesnotneedtostrictlysatisfy .Further, isnon-zeroonlyifthemarginhyperplanesarenotabletoplace onthesamesideastherestoftheinstancesbelongingto .Toillustratethis,Figure4.36 showsacirclePthatfallsontheoppositesideoftheseparatinghyperplaneastherestofthecircles,andthussatisfies .ThedistancebetweenPandthemarginhyperplane isequalto .Hence, providesameasureoftheerrorofSVMinrepresenting usingsoftinequalityconstraints.

Figure4.36.Slackvariablesusedinsoft-marginSVM.

yi(wTxi+b)≥1−ξi (4.80)

ξixi yi(wTxi+b)≥1 ξi

xiyi

wTx+b=−1+ξwTx+b=−1 ξ/ǁwǁ

ξi xi

Inthepresenceofslackvariables,itisimportanttolearnaseparatinghyperplanethatjointlymaximizesthemargin(ensuringgoodgeneralizationperformance)andminimizesthevaluesofslackvariables(ensuringlowtrainingerror).ThiscanbeachievedbymodifyingtheoptimizationproblemofSVMasfollows:

whereCisahyper-parameterthatmakesatrade-offbetweenmaximizingthemarginandminimizingthetrainingerror.AlargevalueofCpaysmoreemphasisonminimizingthetrainingerrorthanmaximizingthemargin.NoticethesimilarityofthepreviousequationwiththegenericformulaofgeneralizationerrorrateintroducedinSection3.4 ofthepreviouschapter.Indeed,SVMprovidesanaturalwaytobalancebetweenmodelcomplexityandtrainingerrorinordertomaximizegeneralizationperformance.

TosolveEquation4.81 weapplytheLagrangemultipliermethodandconverttheprimalproblemtoitscorrespondingdualproblem,similartotheapproachdescribedintheprevioussection.TheLagrangianprimalproblemofEquation4.81 canbewrittenasfollows:

where and aretheLagrangemultiplierscorrespondingtotheinequalityconstraintsofEquation4.81 .Settingthederivativeof withrespectto ,b,and equalto0,weobtainthefollowingequations:

minw,b,ξiǁwǁ22+c∑i=1nξisubjecttoyi(wTxi+b)≥1−ξi,ξi≥0. (4.81)

LP=12ǁwǁ2+C∑i=1nξi−∑i=1nλi(yi(wTxi+b)−1+ξi)−∑i=1nμi(ξi), (4.82)

λi≥0 μi≥0LP

ξi

∂LP∂w=0⇒w=∑i=1nλiyixi. (4.83)

∂L∂b=0⇒∑i=1nλiyi=0. (4.84)

WecanalsoobtainthecomplementaryslacknessconditionsbyusingthefollowingKKTconditions:

Equation4.86 suggeststhat iszeroforalltraininginstancesexceptthosethatresideonthemarginhyperplanes ,orhave .Theseinstanceswith areknownassupportvectors.Ontheotherhand, giveninEquation4.87 iszeroforanytraininginstancethatismisclassified,i.e.,

.Further, and arerelatedwitheachotherbyEquation4.85 .Thisresultsinthefollowingthreeconfigurationsof :

1. If and ,then doesnotresideonthemarginhyperplanesandiscorrectlyclassifiedonthesamesideasotherinstancesbelongingto .

2. If and ,then ismisclassifiedandhasanon-zeroslackvariable .

3. If and ,then residesononeofthemarginhyperplanes.

SubstitutingEquations4.83 to4.87 intoEquation4.82 ,weobtainthefollowingdualoptimizationproblem:

NoticethatthepreviousproblemlooksalmostidenticaltothedualproblemofSVMforthelinearlyseparablecase(Equation4.77 ),exceptthat is

∂L∂ξi=0⇒λi+μi=C. (4.85)

λi(yi(wTxi+b)−1+ξi)=0, (4.86)

μiξi=0. (4.87)

λiwTxi+b=±1 ξi>0

λi>0 μi

ξi>0 λi μi(λi,μi)

λi=0 μi=C xi

yiλi=C μi=0 xi

ξi0<λi<C 0<μi<C xi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxiTxjsubjectto∑i=1nλiyi=0,0≤λi≤C. (4.88)

λi

requiredtonotonlybegreaterthan0butalsosmallerthanaconstantvalueC.Clearly,whenCreachesinfinity,thepreviousoptimizationproblembecomesequivalenttoEquation4.77 ,wherethelearnedhyperplaneperfectlyseparatestheclasses(withnotrainingerrors).However,bycappingthevaluesof toC,thelearnedhyperplaneisabletotolerateafewtrainingerrorsthathave .

Figure4.37.Hingelossasafunctionof .

Asbefore,Equation4.88 canbesolvedbyusinganyofthestandardsolversforQPP,andtheoptimalvalueof canbeobtainedbyusingEquation4.83 .Tosolveforb,wecanuseEquation4.86 onthesupportvectorsthatresideonthemarginhyperplanesasfollows:

λiξi>0

yy^

b=1nS∑i∈S1−yiwTxiyi (4.89)

whereSrepresentsthesetofsupportvectorsresidingonthemarginhyperplanes and isthenumberofelementsinS.

SVMasaRegularizerofHingeLossSVMbelongstoabroadclassofregularizationtechniquesthatusealossfunctiontorepresentthetrainingerrorsandanormofthemodelparameterstorepresentthemodelcomplexity.Torealizethis,noticethattheslackvariable ,usedformeasuringthetrainingerrorsinSVM,isequivalenttothehingelossfunction,whichcanbedefinedasfollows:

where .InthecaseofSVM, correspondsto .Figure4.37 showsaplotofthehingelossfunctionaswevary .Wecanseethatthehingelossisequalto0aslongasyand havethesamesignand

.However,thehingelossgrowslinearlywith wheneveryand areoftheoppositesignor .Thisissimilartothenotionoftheslackvariable,whichisusedtomeasurethedistanceofapointfromitsmarginhyperplane.Hence,theoptimizationproblemofSVMcanberepresentedinthefollowingequivalentform:

Notethatusingthehingelossensuresthattheoptimizationproblemisconvexandcanbesolvedusingstandardoptimizationtechniques.However,ifweuseadifferentlossfunction,suchasthesquaredlossfunctionthatwasintroducedinSection4.7 onANN,itwillresultinadifferentoptimizationproblemthatmayormaynotremainconvex.Nevertheless,differentlossfunctionscanbeexploredtocapturevaryingnotionsoftrainingerror,dependingonthecharacteristicsoftheproblem.

(S={i|0<λi<C}) nS

ξ

Loss(y,y^)=max(0,1−yy^),

y∈{+1,−1} y^ wTx+byy^

y^|y^|≥1 |y^| y^

|y^|<1ξ

minw,bǁwǁ22+C∑i=1nLoss(yi,wTxi+b) (4.90)

AnotherinterestingpropertyofSVMthatrelatesittoabroaderclassofregularizationtechniquesistheconceptofamargin.Althoughminimizinghasthegeometricinterpretationofmaximizingthemarginofaseparating

hyperplane,itisessentiallythesquared normofthemodelparameters,.Ingeneral,the normof , ,isequaltotheMinkowskidistanceof

orderqfrom totheorigin,i.e.,

Minimizingthe normof toachievelowermodelcomplexityisagenericregularizationconceptthathasseveralinterpretations.Forexample,minimizingthe normamountstofindingasolutiononahypersphereofsmallestradiusthatshowssuitabletrainingperformance.Tovisualizethisintwo-dimensions,Figure4.38(a) showstheplotofacirclewithconstantradiusr,whereeverypointhasthesame norm.Ontheotherhand,usingthe normensuresthatthesolutionliesonthesurfaceofahypercubewithsmallestsize,withverticesalongtheaxes.ThisisillustratedinFigure4.38(b) asasquarewithverticesontheaxesatadistanceofrfromtheorigin.The normiscommonlyusedasaregularizertoobtainsparsemodelparameterswithonlyasmallnumberofnon-zeroparametervalues,suchastheuseofLassoinregressionproblems(seeBibliographicNotes).

ǁwǁ2

L2 ǁwǁ22 Lq ǁwǁq

ǁwǁq=(∑ipwiq)1/q

Lq

L2

L2L1

L1

Figure4.38.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Ingeneral,dependingonthecharacteristicsoftheproblem,differentcombinationsof normsandtraininglossfunctionscanbeusedforlearningthemodelparameters,eachrequiringadifferentoptimizationsolver.Thisformsthebackboneofawiderangeofmodelingtechniquesthatattempttoimprovethegeneralizationperformancebyjointlyminimizingtrainingerrorandmodelcomplexity.However,inthissection,wefocusonlyonthesquarednormandthehingelossfunction,resultingintheclassicalformulationof

SVM.

4.9.4NonlinearSVM

L2L1

Lq

L2

TheSVMformulationsdescribedintheprevioussectionsconstructalineardecisionboundarytoseparatethetrainingexamplesintotheirrespectiveclasses.ThissectionpresentsamethodologyforapplyingSVMtodatasetsthathavenonlineardecisionboundaries.Thebasicideaistotransformthedatafromitsoriginalattributespacein intoanewspace sothatalinearhyperplanecanbeusedtoseparatetheinstancesinthetransformedspace,usingtheSVMapproach.Thelearnedhyperplanecanthenbeprojectedbacktotheoriginalattributespace,resultinginanonlineardecisionboundary.

Figure4.39.Classifyingdatawithanonlineardecisionboundary.

AttributeTransformationToillustratehowattributetransformationcanleadtoalineardecisionboundary,Figure4.39(a) showsanexampleofatwo-dimensionaldatasetconsistingofsquares(classifiedas )andcircles(classifiedas ).The

φ(x)

y=1 y=−1

datasetisgeneratedinsuchawaythatallthecirclesareclusterednearthecenterofthediagramandallthesquaresaredistributedfartherawayfromthecenter.Instancesofthedatasetcanbeclassifiedusingthefollowingequation:

Thedecisionboundaryforthedatacanthereforebewrittenasfollows:

whichcanbefurthersimplifiedintothefollowingquadraticequation:

Anonlineartransformation isneededtomapthedatafromitsoriginalattributespaceintoanewspacesuchthatalinearhyperplanecanseparatetheclasses.Thiscanbeachievedbyusingthefollowingsimpletransformation:

Figure4.39(b) showsthepointsinthetransformedspace,wherewecanseethatallthecirclesarelocatedinthelowerleft-handsideofthediagram.Alinearhyperplanewithparameters andbcanthereforebeconstructedinthetransformedspace,toseparatetheinstancesintotheirrespectiveclasses.

Onemaythinkthatbecausethenonlineartransformationpossiblyincreasesthedimensionalityoftheinputspace,thisapproachcansufferfromthecurseofdimensionalitythatisoftenassociatedwithhigh-dimensionaldata.

y={1if(x1−0.5)2+(x2−0.5)2>0.2,−1otherwise. (4.91)

(x1−0.5)2+(x2−0.5)2>0.2,

x12−x1+x22−x2=−0.46.

φ

φ:(x1,x2)→(x12−x1,x22−x2). (4.92)

However,aswewillseeinthefollowingsection,nonlinearSVMisabletoavoidthisproblembyusingkernelfunctions.

LearningaNonlinearSVMModelUsingasuitablefunction, ,wecantransformanydatainstance to .(Thedetailsonhowtochoose willbecomeclearlater.)Thelinearhyperplaneinthetransformedspacecanbeexpressedas .Tolearntheoptimalseparatinghyperplane,wecansubstitute for intheformulationofSVMtoobtainthefollowingoptimizationproblem:

UsingLagrangemultipliers ,thiscanbeconvertedintoadualoptimizationproblem:max

where denotestheinnerproductbetweenvectors and .Also,theequationofthehyperplaneinthetransformedspacecanberepresentedusing

asfollows:

Further,bisgivenby

φ(⋅) φ(x)φ(⋅)

wTφ(x)+b=0φ(x)

minw,b,ξiǁwǁ22+C∑i=1nξisubjecttoyi(wTφ(xi)+b)≥1−ξi,ξi≥0. (4.93)

λi

maxλi∑i=1nλi−12∑i=1n∑j=1nλiλjyiyj⟨φ(xi),φ(xj)⟩subjectto∑i=1nλjyi=0,0≤λi≤C,

(4.94)

⟨a,b⟩

λi

∑i=1nλiyi⟨φ(xi),φ(x)⟩+b=0. (4.95)

b=1nS(∑i∈S1yi−∑i∈S∑j=1nλjyiyj⟨φ(xi),φ(xj)⟩yi), (4.96)

where isthesetofsupportvectorsresidingonthemarginhyperplanesand isthenumberofelementsinS.

NotethatinordertosolvethedualoptimizationprobleminEquation4.94 ,ortousethelearnedmodelparameterstomakepredictionsusingEquations4.95 and4.96 ,weneedonlyinnerproductsof .Hence,eventhough

maybenonlinearandhigh-dimensional,itsufficestouseafunctionoftheinnerproductsof inthetransformedspace.Thiscanbeachievedbyusingakerneltrick,whichcanbedescribedasfollows.

Theinnerproductbetweentwovectorsisoftenregardedasameasureofsimilaritybetweenthevectors.Forexample,thecosinesimilaritydescribedinSection2.4.5 onpage79canbedefinedasthedotproductbetweentwovectorsthatarenormalizedtounitlength.Analogously,theinnerproduct

canalsoberegardedasameasureofsimilaritybetweentwoinstances, and ,inthetransformedspace.Thekerneltrickisamethodforcomputingthissimilarityasafunctionoftheoriginalattributes.Specifically,thekernelfunctionK(u,v)betweentwoinstancesuandvcanbedefinedasfollows:

where isafunctionthatfollowscertainconditionsasstatedbytheMercer'sTheorem.Althoughthedetailsofthistheoremareoutsidethescopeofthebook,weprovidealistofsomeofthecommonlyusedkernelfunctions:

S={i|0>λi<C}nS

φ(x)φ(x)

φ(x)

φ(xi),φ(xj)xi xj

K(u,v)=⟨φ(u),φ(v)⟩=f(u,v) (4.97)

f(⋅)

PolynomialkernelK(u,v)=(uTv+1)p (4.98)

RadialBasisFunctionkernelK(u,v)=e−ǁu−vǁ2/(2σ2) (4.99)

SigmoidkernelK(u,v)=tanh(kuTv−δ) (4.100)

Byusingakernelfunction,wecandirectlyworkwithinnerproductsinthetransformedspacewithoutdealingwiththeexactformsofthenonlineartransformationfunction .Specifically,thisallowsustousehigh-dimensionaltransformations(sometimeseveninvolvinginfinitelymanydimensions),whileperformingcalculationsonlyintheoriginalattributespace.Computingtheinnerproductsusingkernelfunctionsisalsoconsiderablycheaperthanusingthetransformedattributeset .Hence,theuseofkernelfunctionsprovidesasignificantadvantageinrepresentingnonlineardecisionboundaries,withoutsufferingfromthecurseofdimensionality.ThishasbeenoneofthemajorreasonsbehindthewidespreadusageofSVMinhighlycomplexandnonlinearproblems.

Figure4.40.DecisionboundaryproducedbyanonlinearSVMwithpolynomialkernel.

Figure4.40 showsthenonlineardecisionboundaryobtainedbySVMusingthepolynomialkernelfunctiongiveninEquation4.98 .Wecanseethatthe

φ

φ(x)

learneddecisionboundaryisquiteclosetothetruedecisionboundaryshowninFigure4.39(a) .Althoughthechoiceofkernelfunctiondependsonthecharacteristicsoftheinputdata,acommonlyusedkernelfunctionistheradialbasisfunction(RBF)kernel,whichinvolvesasinglehyper-parameter ,knownasthestandarddeviationoftheRBFkernel.

4.9.5CharacteristicsofSVM

1. TheSVMlearningproblemcanbeformulatedasaconvexoptimizationproblem,inwhichefficientalgorithmsareavailabletofindtheglobalminimumoftheobjectivefunction.Otherclassificationmethods,suchasrule-basedclassifiersandartificialneuralnetworks,employagreedystrategytosearchthehypothesisspace.Suchmethodstendtofindonlylocallyoptimumsolutions.

2. SVMprovidesaneffectivewayofregularizingthemodelparametersbymaximizingthemarginofthedecisionboundary.Furthermore,itisabletocreateabalancebetweenmodelcomplexityandtrainingerrorsbyusingahyper-parameterC.Thistrade-offisgenerictoabroaderclassofmodellearningtechniquesthatcapturethemodelcomplexityandthetraininglossusingdifferentformulations.

3. LinearSVMcanhandleirrelevantattributesbylearningzeroweightscorrespondingtosuchattributes.Itcanalsohandleredundantattributesbylearningsimilarweightsfortheduplicateattributes.Furthermore,theabilityofSVMtoregularizeitslearningmakesitmorerobusttothepresenceofalargenumberofirrelevantandredundantattributesthanotherclassifiers,eveninhigh-dimensionalsettings.Forthisreason,nonlinearSVMsarelessimpactedbyirrelevantandredundantattributesthanotherhighlyexpressiveclassifiersthatcanlearnnonlineardecisionboundariessuchasdecisiontrees.

σ

TocomparetheeffectofirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees,considerthetwo-dimensionaldatasetshowninFigure4.41(a) containing and instances,wherethetwoclassescanbeeasilyseparatedusinganonlineardecisionboundary.Weincrementallyaddirrelevantattributestothisdatasetandcomparetheperformanceoftwoclassifiers:decisiontreeandnonlinearSVM(usingradialbasisfunctionkernel),using70%ofthedatafortrainingandtherestfortesting.Figure4.41(b) showsthetesterrorratesofthetwoclassifiersasweincreasethenumberofirrelevantattributes.Wecanseethatthetesterrorrateofdecisiontreesswiftlyreaches0.5(sameasrandomguessing)inthepresenceofevenasmallnumberofirrelevantattributes.ThiscanbeattributedtotheproblemofmultiplecomparisonswhilechoosingsplittingattributesatinternalnodesasdiscussedinExample3.7 ofthepreviouschapter.Ontheotherhand,nonlinearSVMshowsamorerobustandsteadyperformanceevenafteraddingamoderatelylargenumberofirrelevantattributes.Itstesterrorrategraduallydeclinesandeventuallyreachescloseto0.5afteradding125irrelevantattributes,atwhichpointitbecomesdifficulttodiscernthediscriminativeinformationintheoriginaltwoattributesfromthenoiseintheremainingattributesforlearningnonlineardecisionboundaries.

500+ 500o

Figure4.41.ComparingtheeffectofaddingirrelevantattributesontheperformanceofnonlinearSVMsanddecisiontrees.

4. SVMcanbeappliedtocategoricaldatabyintroducingdummyvariablesforeachcategoricalattributevaluepresentinthedata.Forexample,if hasthreevalues ,wecanintroduceabinaryvariableforeachoftheattributevalues.

5. TheSVMformulationpresentedinthischapterisforbinaryclassproblems.However,multiclassextensionsofSVMhavealsobeenproposed.

6. AlthoughthetrainingtimeofanSVMmodelcanbelarge,thelearnedparameterscanbesuccinctlyrepresentedwiththehelpofasmallnumberofsupportvectors,makingtheclassificationoftestinstancesquitefast.

{Single,Married,Divorced}

4.10EnsembleMethodsThissectionpresentstechniquesforimprovingclassificationaccuracybyaggregatingthepredictionsofmultipleclassifiers.Thesetechniquesareknownasensembleorclassifiercombinationmethods.Anensemblemethodconstructsasetofbaseclassifiersfromtrainingdataandperformsclassificationbytakingavoteonthepredictionsmadebyeachbaseclassifier.Thissectionexplainswhyensemblemethodstendtoperformbetterthananysingleclassifierandpresentstechniquesforconstructingtheclassifierensemble.

4.10.1RationaleforEnsembleMethod

Thefollowingexampleillustrateshowanensemblemethodcanimproveaclassifier'sperformance.

Example4.8.Consideranensembleof25binaryclassifiers,eachofwhichhasanerrorrateof .Theensembleclassifierpredictstheclasslabelofatestexamplebytakingamajorityvoteonthepredictionsmadebythebaseclassifiers.Ifthebaseclassifiersareidentical,thenallthebaseclassifierswillcommitthesamemistakes.Thus,theerrorrateoftheensembleremains0.35.Ontheotherhand,ifthebaseclassifiersareindependent—i.e.,theirerrorsareuncorrelated—thentheensemblemakesawrongpredictiononlyifmorethanhalfofthebaseclassifierspredictincorrectly.Inthiscase,theerrorrateoftheensembleclassifieris

∈=0.35

whichisconsiderablylowerthantheerrorrateofthebaseclassifiers.

Figure4.42 showstheerrorrateofanensembleof25binaryclassifiersfordifferentbaseclassifiererrorrates .Thediagonalline

representsthecaseinwhichthebaseclassifiersareidentical,whilethesolidlinerepresentsthecaseinwhichthebaseclassifiersareindependent.Observethattheensembleclassifierperformsworsethanthebaseclassifierswhen islargerthan0.5.

Theprecedingexampleillustratestwonecessaryconditionsforanensembleclassifiertoperformbetterthanasingleclassifier:(1)thebaseclassifiersshouldbeindependentofeachother,and(2)thebaseclassifiersshoulddobetterthanaclassifierthatperformsrandomguessing.Inpractice,itisdifficulttoensuretotalindependenceamongthebaseclassifiers.Nevertheless,improvementsinclassificationaccuracieshavebeenobservedinensemblemethodsinwhichthebaseclassifiersaresomewhatcorrelated.

4.10.2MethodsforConstructinganEnsembleClassifier

AlogicalviewoftheensemblemethodispresentedinFigure4.43 .Thebasicideaistoconstructmultipleclassifiersfromtheoriginaldataandthenaggregatetheirpredictionswhenclassifyingunknownexamples.Theensembleofclassifierscanbeconstructedinmanyways:

eensemble=∑i=1325(25i)∈i(1−∈)25−i=0.06, (4.101)

(eensemble) (∈)

1. Bymanipulatingthetrainingset.Inthisapproach,multipletrainingsetsarecreatedbyresamplingtheoriginaldataaccordingtosomesamplingdistributionandconstructingaclassifierfromeachtrainingset.Thesamplingdistributiondetermineshowlikelyitisthatanexamplewillbeselectedfortraining,anditmayvaryfromonetrialtoanother.Baggingandboostingaretwoexamplesofensemblemethodsthatmanipulatetheirtrainingsets.ThesemethodsaredescribedinfurtherdetailinSections4.10.4 and4.10.5 .

Figure4.42.Comparisonbetweenerrorsofbaseclassifiersanderrorsoftheensembleclassifier.

Figure4.43.Alogicalviewoftheensemblelearningmethod.

2. Bymanipulatingtheinputfeatures.Inthisapproach,asubsetofinputfeaturesischosentoformeachtrainingset.Thesubsetcanbeeitherchosenrandomlyorbasedontherecommendationofdomainexperts.Somestudieshaveshownthatthisapproachworksverywellwithdatasetsthatcontainhighlyredundantfeatures.Randomforest,whichisdescribedinSection4.10.6 ,isanensemblemethodthatmanipulatesitsinputfeaturesandusesdecisiontreesasitsbaseclassifiers.

3. Bymanipulatingtheclasslabels.Thismethodcanbeusedwhenthenumberofclassesissufficientlylarge.Thetrainingdataistransformedintoabinaryclassproblembyrandomlypartitioningtheclasslabelsintotwodisjointsubsets, and .TrainingexampleswhoseclassA0 A1

labelbelongstothesubset areassignedtoclass0,whilethosethatbelongtothesubset areassignedtoclass1.Therelabeledexamplesarethenusedtotrainabaseclassifier.Byrepeatingthisprocessmultipletimes,anensembleofbaseclassifiersisobtained.Whenatestexampleispresented,eachbaseclassifier isusedtopredictitsclasslabel.Ifthetestexampleispredictedasclass0,thenalltheclassesthatbelongto willreceiveavote.Conversely,ifitispredictedtobeclass1,thenalltheclassesthatbelongto willreceiveavote.Thevotesaretalliedandtheclassthatreceivesthehighestvoteisassignedtothetestexample.Anexampleofthisapproachistheerror-correctingoutputcodingmethoddescribedonpage331.

4. Bymanipulatingthelearningalgorithm.Manylearningalgorithmscanbemanipulatedinsuchawaythatapplyingthealgorithmseveraltimesonthesametrainingdatawillresultintheconstructionofdifferentclassifiers.Forexample,anartificialneuralnetworkcanchangeitsnetworktopologyortheinitialweightsofthelinksbetweenneurons.Similarly,anensembleofdecisiontreescanbeconstructedbyinjectingrandomnessintothetree-growingprocedure.Forexample,insteadofchoosingthebestsplittingattributeateachnode,wecanrandomlychooseoneofthetopkattributesforsplitting.

Thefirstthreeapproachesaregenericmethodsthatareapplicabletoanyclassifier,whereasthefourthapproachdependsonthetypeofclassifierused.Thebaseclassifiersformostoftheseapproachescanbegeneratedsequentially(oneafteranother)orinparallel(allatonce).Onceanensembleofclassifiershasbeenlearned,atestexample isclassifiedbycombiningthepredictionsmadebythebaseclassifiers :

A0A1

Ci

A0A1

Ci(x)

C*(x)=f(C1(x),C2(x),…,Ck(x)).

wherefisthefunctionthatcombinestheensembleresponses.Onesimpleapproachforobtaining istotakeamajorityvoteoftheindividualpredictions.Analternateapproachistotakeaweightedmajorityvote,wheretheweightofabaseclassifierdenotesitsaccuracyorrelevance.

Ensemblemethodsshowthemostimprovementwhenusedwithunstableclassifiers,i.e.,baseclassifiersthataresensitivetominorperturbationsinthetrainingset,becauseofhighmodelcomplexity.Althoughunstableclassifiersmayhavealowbiasinfindingtheoptimaldecisionboundary,theirpredictionshaveahighvarianceforminorchangesinthetrainingsetormodelselection.Thistrade-offbetweenbiasandvarianceisdiscussedindetailinthenextsection.Byaggregatingtheresponsesofmultipleunstableclassifiers,ensemblelearningattemptstominimizetheirvariancewithoutworseningtheirbias.

4.10.3Bias-VarianceDecomposition

Bias-variancedecompositionisaformalmethodforanalyzingthegeneralizationerrorofapredictivemodel.Althoughtheanalysisisslightlydifferentforclassificationthanregression,wefirstdiscussthebasicintuitionofthisdecompositionbyusingananalogueofaregressionproblem.

Considertheillustrativetaskofreachingatargetybyfiringprojectilesfromastartingposition ,asshowninFigure4.44 .Thetargetcorrespondstothedesiredoutputatatestinstance,whilethestartingpositioncorrespondstoitsobservedattributes.Inthisanalogy,theprojectilerepresentsthemodelusedforpredictingthetargetusingtheobservedattributes.Let denotethepointwheretheprojectilehitstheground,whichisanalogousofthepredictionofthemodel.

C*(x)

y^

Figure4.44.Bias-variancedecomposition.

Ideally,wewouldlikeourpredictionstobeasclosetothetruetargetaspossible.However,notethatdifferenttrajectoriesofprojectilesarepossiblebasedondifferencesinthetrainingdataorintheapproachusedformodelselection.Hence,wecanobserveavarianceinthepredictions overdifferentrunsofprojectile.Further,thetargetinourexampleisnotfixedbuthassomefreedomtomovearound,resultinginanoisecomponentinthetruetarget.Thiscanbeunderstoodasthenon-deterministicnatureoftheoutputvariable,wherethesamesetofattributescanhavedifferentoutputvalues.Let representtheaveragepredictionoftheprojectileovermultipleruns,and denotetheaveragetargetvalue.Thedifferencebetween and

isknownasthebiasofthemodel.

Inthecontextofclassification,itcanbeshownthatthegeneralizationerrorofaclassificationmodelmcanbedecomposedintotermsinvolvingthebias,variance,andnoisecomponentsofthemodelinthefollowingway:

where and areconstantsthatdependonthecharacteristicsoftrainingandtestsets.Notethatwhilethenoisetermisintrinsictothetargetclass,the

y^

y^avgyavg y^avg

yavg

gen.error(m)=c1×noise+bias(m)+c2×variance(m)

c1 c2

biasandvariancetermsdependonthechoiceoftheclassificationmodel.Thebiasofamodelrepresentshowclosetheaveragepredictionofthemodelistotheaveragetarget.Modelsthatareabletolearncomplexdecisionboundaries,e.g.,modelsproducedbyk-nearestneighborandmulti-layerANN,generallyshowlowbias.Thevarianceofamodelcapturesthestabilityofitspredictionsinresponsetominorperturbationsinthetrainingsetorthemodelselectionapproach.

Wecansaythatamodelshowsbettergeneralizationperformanceifithasalowerbiasandlowervariance.However,ifthecomplexityofamodelishighbutthetrainingsizeissmall,wegenerallyexpecttoseealowerbiasbuthighervariance,resultinginthephenomenaofoverfitting.ThisphenomenaispictoriallyrepresentedinFigure4.45(a) .Ontheotherhand,anoverlysimplisticmodelthatsuffersfromunderfittingmayshowalowervariancebutwouldsufferfromahighbias,asshowninFigure4.45(b) .Hence,thetrade-offbetweenbiasandvarianceprovidesausefulwayforinterpretingtheeffectsofunderfittingandoverfittingonthegeneralizationperformanceofamodel.

Figure4.45.Plotsshowingthebehavioroftwo-dimensionalsolutionswithconstant andnorms.

Thebias-variancetrade-offcanbeusedtoexplainwhyensemblelearningimprovesthegeneralizationperformanceofunstableclassifiers.Ifabaseclassifiershowlowbiasbuthighvariance,itcanbecomesusceptibletooverfitting,asevenasmallchangeinthetrainingsetwillresultindifferentpredictions.However,bycombiningtheresponsesofmultiplebaseclassifiers,wecanexpecttoreducetheoverallvariance.Hence,ensemblelearningmethodsshowbetterperformanceprimarilybyloweringthevarianceinthepredictions,althoughtheycanevenhelpinreducingthebias.Oneofthesimplestapproachesforcombiningpredictionsandreducingtheirvarianceistocomputetheiraverage.Thisformsthebasisofthebaggingmethod,describedinthefollowingsubsection.

L2L1

4.10.4Bagging

Bagging,whichisalsoknownasbootstrapaggregating,isatechniquethatrepeatedlysamples(withreplacement)fromadatasetaccordingtoauniformprobabilitydistribution.Eachbootstrapsamplehasthesamesizeastheoriginaldata.Becausethesamplingisdonewithreplacement,someinstancesmayappearseveraltimesinthesametrainingset,whileothersmaybeomittedfromthetrainingset.Onaverage,abootstrapsample containsapproximately63%oftheoriginaltrainingdatabecauseeachsamplehasaprobability ofbeingselectedineach .IfNissufficientlylarge,thisprobabilityconvergesto .ThebasicprocedureforbaggingissummarizedinAlgorithm4.5 .Aftertrainingthekclassifiers,atestinstanceisassignedtotheclassthatreceivesthehighestnumberofvotes.

Toillustratehowbaggingworks,considerthedatasetshowninTable4.4 .Letxdenoteaone-dimensionalattributeandydenotetheclasslabel.Supposeweuseonlyone-levelbinarydecisiontrees,withatestcondition

,wherekisasplitpointchosentominimizetheentropyoftheleafnodes.Suchatreeisalsoknownasadecisionstump.

Table4.4.Exampleofdatasetusedtoconstructanensembleofbaggingclassifiers.

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y 1 1 1 1 1 1

Withoutbagging,thebestdecisionstumpwecanproducesplitstheinstancesateither or .Eitherway,theaccuracyofthetreeisatmost70%.Supposeweapplythebaggingprocedureonthedatasetusing10bootstrapsamples.Theexampleschosenfortrainingineachbaggingroundareshown

Di

1−(1−1/N)N Di1−1/e≃0.632

x≤k

−1 −1 −1 −1

x≤0.35 x≤0.75

inFigure4.46 .Ontheright-handsideofeachtable,wealsodescribethedecisionstumpbeingusedineachround.

WeclassifytheentiredatasetgiveninTable4.4 bytakingamajorityvoteamongthepredictionsmadebyeachbaseclassifier.TheresultsofthepredictionsareshowninFigure4.47 .Sincetheclasslabelsareeitheror ,takingthemajorityvoteisequivalenttosummingupthepredictedvaluesofyandexaminingthesignoftheresultingsum(refertothesecondtolastrowinFigure4.47 ).Noticethattheensembleclassifierperfectlyclassifiesall10examplesintheoriginaldata.

Algorithm4.5Baggingalgorithm.

−1+1

Figure4.46.Exampleofbagging.

Theprecedingexampleillustratesanotheradvantageofusingensemblemethodsintermsofenhancingtherepresentationofthetargetfunction.Eventhougheachbaseclassifierisadecisionstump,combiningtheclassifierscanleadtoadecisionboundarythatmimicsadecisiontreeofdepth2.

Baggingimprovesgeneralizationerrorbyreducingthevarianceofthebaseclassifiers.Theperformanceofbaggingdependsonthestabilityofthebaseclassifier.Ifabaseclassifierisunstable,bagginghelpstoreducetheerrorsassociatedwithrandomfluctuationsinthetrainingdata.Ifabaseclassifierisstable,i.e.,robusttominorperturbationsinthetrainingset,thentheerroroftheensembleisprimarilycausedbybiasinthebaseclassifier.Inthissituation,baggingmaynotbeabletoimprovetheperformanceofthebaseclassifierssignificantly.Itmayevendegradetheclassifier'sperformancebecausetheeffectivesizeofeachtrainingsetisabout37%smallerthantheoriginaldata.

Figure4.47.Exampleofcombiningclassifiersconstructedusingthebaggingapproach.

4.10.5Boosting

Boostingisaniterativeprocedureusedtoadaptivelychangethedistributionoftrainingexamplesforlearningbaseclassifierssothattheyincreasinglyfocusonexamplesthatarehardtoclassify.Unlikebagging,boostingassignsaweighttoeachtrainingexampleandmayadaptivelychangetheweightattheendofeachboostinground.Theweightsassignedtothetrainingexamplescanbeusedinthefollowingways:

1. Theycanbeusedtoinformthesamplingdistributionusedtodrawasetofbootstrapsamplesfromtheoriginaldata.

2. Theycanbeusedtolearnamodelthatisbiasedtowardexampleswithhigherweight.

Thissectiondescribesanalgorithmthatusesweightsofexamplestodeterminethesamplingdistributionofitstrainingset.Initially,theexamplesareassignedequalweights,1/N,sothattheyareequallylikelytobechosenfortraining.Asampleisdrawnaccordingtothesamplingdistributionofthetrainingexamplestoobtainanewtrainingset.Next,aclassifierisbuiltfromthetrainingsetandusedtoclassifyalltheexamplesintheoriginaldata.Theweightsofthetrainingexamplesareupdatedattheendofeachboostinground.Examplesthatareclassifiedincorrectlywillhavetheirweightsincreased,whilethosethatareclassifiedcorrectlywillhavetheirweightsdecreased.Thisforcestheclassifiertofocusonexamplesthataredifficulttoclassifyinsubsequentiterations.

Thefollowingtableshowstheexampleschosenduringeachboostinground,whenappliedtothedatashowninTable4.4 .

Boosting(Round1): 7 3 2 8 7 9 4 10 6 3

Boosting(Round2): 5 4 9 4 2 5 1 7 4 2

Boosting(Round3): 4 4 8 10 4 5 4 6 3 4

Initially,alltheexamplesareassignedthesameweights.However,someexamplesmaybechosenmorethanonce,e.g.,examples3and7,becausethesamplingisdonewithreplacement.Aclassifierbuiltfromthedataisthenusedtoclassifyalltheexamples.Supposeexample4isdifficulttoclassify.Theweightforthisexamplewillbeincreasedinfutureiterationsasitgetsmisclassifiedrepeatedly.Meanwhile,examplesthatwerenotchoseninthepreviousround,e.g.,examples1and5,alsohaveabetterchanceofbeingselectedinthenextroundsincetheirpredictionsinthepreviousroundwerelikelytobewrong.Astheboostingroundsproceed,examplesthatarethehardesttoclassifytendtobecomeevenmoreprevalent.Thefinalensembleisobtainedbyaggregatingthebaseclassifiersobtainedfromeachboostinground.

Overtheyears,severalimplementationsoftheboostingalgorithmhavebeendeveloped.Thesealgorithmsdifferintermsof(1)howtheweightsofthetrainingexamplesareupdatedattheendofeachboostinground,and(2)howthepredictionsmadebyeachclassifierarecombined.AnimplementationcalledAdaBoostisexploredinthenextsection.

AdaBoostLet denoteasetofNtrainingexamples.IntheAdaBoostalgorithm,theimportanceofabaseclassifier dependsonits

{(xj,yj)|j=1,2,…,N}Ci

Figure4.48.Plotof asafunctionoftrainingerror .

errorrate,whichisdefinedas

where ifthepredicatepistrue,and0otherwise.Theimportanceofaclassifier isgivenbythefollowingparameter,

Notethat hasalargepositivevalueiftheerrorrateiscloseto0andalargenegativevalueiftheerrorrateiscloseto1,asshowninFigure4.48 .

The parameterisalsousedtoupdatetheweightofthetrainingexamples.Toillustrate,let denotetheweightassignedtoexample( duringthe

α ∈

∈i=1N[∑j=1NwjI(Ci(xj)≠yj)], (4.102)

I(p)=1Ci

αi=12ln(1−∈i∈i).

αi

αiwi(j) xi,yi)

th

j boostinground.TheweightupdatemechanismforAdaBoostisgivenbytheequation:

where isthenormalizationfactorusedtoensurethat .TheweightupdateformulagiveninEquation4.103 increasestheweightsofincorrectlyclassifiedexamplesanddecreasestheweightsofthoseclassifiedcorrectly.

Insteadofusingamajorityvotingscheme,thepredictionmadebyeachclassifier isweightedaccordingto .ThisapproachallowsAdaBoosttopenalizemodelsthathavepooraccuracy,e.g.,thosegeneratedattheearlierboostingrounds.Inaddition,ifanyintermediateroundsproduceanerrorratehigherthan50%,theweightsarerevertedbacktotheiroriginaluniformvalues, ,andtheresamplingprocedureisrepeated.TheAdaBoostalgorithmissummarizedinAlgorithm4.6 .

Algorithm4.6AdaBoostalgorithm.

∈ ∑

th

wi(j+1)=wi(j)Zj×{e−αjifCj(xi)=yi,eαjifCj(xi)≠yi (4.103)

Zj ∑iwi(j+1)=1

Cj αj

wi=1/N

∈ ∈

LetusexaminehowtheboostingapproachworksonthedatasetshowninTable4.4 .Initially,alltheexampleshaveidenticalweights.Afterthreeboostingrounds,theexampleschosenfortrainingareshowninFigure4.49(a) .TheweightsforeachexampleareupdatedattheendofeachboostingroundusingEquation4.103 ,asshowninFigure4.50(b) .

Withoutboosting,theaccuracyofthedecisionstumpis,atbest,70%.WithAdaBoost,theresultsofthepredictionsaregiveninFigure4.50(b) .Thefinalpredictionoftheensembleclassifierisobtainedbytakingaweightedaverageofthepredictionsmadebyeachbaseclassifier,whichisshowninthelastrowofFigure4.50(b) .NoticethatAdaBoostperfectlyclassifiesalltheexamplesinthetrainingdata.

Figure4.49.Exampleofboosting.

Animportantanalyticalresultofboostingshowsthatthetrainingerroroftheensembleisboundedbythefollowingexpression:

where istheerrorrateofeachbaseclassifieri.Iftheerrorrateofthebaseclassifierislessthan50%,wecanwrite ,where measureshowmuchbettertheclassifieristhanrandomguessing.Theboundonthetrainingerroroftheensemblebecomes

eensemble≤∏i[∈i(1−∈i)], (4.104)

∈i∈i=0.5−γi γi

Hence,thetrainingerroroftheensembledecreasesexponentially,whichleadstothefastconvergenceofthealgorithm.Byfocusingonexamplesthataredifficulttoclassifybybaseclassifiers,itisabletoreducethebiasofthefinalpredictionsalongwiththevariance.AdaBoosthasbeenshowntoprovidesignificantimprovementsinperformanceoverbaseclassifiersonarangeofdatasets.Nevertheless,becauseofitstendencytofocusontrainingexamplesthatarewronglyclassified,theboostingtechniquecanbesusceptibletooverfitting,resultinginpoorgeneralizationperformanceinsomescenarios.

Figure4.50.ExampleofcombiningclassifiersconstructedusingtheAdaBoostapproach.

eensemble≤∏i1−4γi2≤exp(−2∑iγi2). (4.105)

4.10.6RandomForests

Randomforestsattempttoimprovethegeneralizationperformancebyconstructinganensembleofdecorrelateddecisiontrees.Randomforestsbuildontheideaofbaggingtouseadifferentbootstrapsampleofthetrainingdataforlearningdecisiontrees.However,akeydistinguishingfeatureofrandomforestsfrombaggingisthatateveryinternalnodeofatree,thebestsplittingcriterionischosenamongasmallsetofrandomlyselectedattributes.Inthisway,randomforestsconstructensemblesofdecisiontreesbynotonlymanipulatingtraininginstances(byusingbootstrapsamplessimilartobagging),butalsotheinputattributes(byusingdifferentsubsetsofattributesateveryinternalnode).

GivenatrainingsetDconsistingofninstancesanddattributes,thebasicprocedureoftrainingarandomforestclassifiercanbesummarizedusingthefollowingsteps:

1. Constructabootstrapsample ofthetrainingsetbyrandomlysamplingninstances(withreplacement)fromD.

2. Use tolearnadecisiontree asfollows.Ateveryinternalnodeof,randomlysampleasetofpattributesandchooseanattributefrom

thissubsetthatshowsthemaximumreductioninanimpuritymeasureforsplitting.Repeatthisproceduretilleveryleafispure,i.e.,containinginstancesfromthesameclass.

Onceanensembleofdecisiontreeshavebeenconstructed,theiraverageprediction(majorityvote)onatestinstanceisusedasthefinalpredictionoftherandomforest.Notethatthedecisiontreesinvolvedinarandomforestareunprunedtrees,astheyareallowedtogrowtotheirlargestpossiblesizetilleveryleafispure.Hence,thebaseclassifiersofrandomforestrepresent

Di

Di TiTi

unstableclassifiersthathavelowbiasbuthighvariance,becauseoftheirlargesize.

Anotherpropertyofthebaseclassifierslearnedinrandomforestsisthelackofcorrelationamongtheirmodelparametersandtestpredictions.Thiscanbeattributedtotheuseofanindependentlysampleddataset forlearningeverydecisiontree ,similartothebaggingapproach.However,randomforestshavetheadditionaladvantageofchoosingasplittingcriterionateveryinternalnodeusingadifferent(andrandomlyselected)subsetofattributes.Thispropertysignificantlyhelpsinbreakingthecorrelationstructure,ifany,amongthedecisiontrees .

Torealizethis,consideratrainingsetinvolvingalargenumberofattributes,whereonlyasmallsubsetofattributesarestrongpredictorsofthetargetclass,whereasotherattributesareweakindicators.Givensuchatrainingset,evenifweconsiderdifferentbootstrapsamples forlearning ,wewouldmostlybechoosingthesameattributesforsplittingatinternalnodes,becausetheweakattributeswouldbelargelyoverlookedwhencomparedwiththestrongpredictors.Thiscanresultinaconsiderablecorrelationamongthetrees.However,ifwerestrictthechoiceofattributesateveryinternalnodetoarandomsubsetofattributes,wecanensuretheselectionofbothstrongandweakpredictors,thuspromotingdiversityamongthetrees.Thisprincipleisutilizedbyrandomforestsforcreatingdecorrelateddecisiontrees.

Byaggregatingthepredictionsofanensembleofstronganddecorrelateddecisiontrees,randomforestsareabletoreducethevarianceofthetreeswithoutnegativelyimpactingtheirlowbias.Thismakesrandomforestsquiterobusttooverfitting.Additionally,becauseoftheirabilitytoconsideronlyasmallsubsetofattributesateveryinternalnode,randomforestsarecomputationallyfastandrobusteveninhigh-dimensionalsettings.

DiTi

Ti

Di Ti

Thenumberofattributestobeselectedateverynode,p,isahyper-parameteroftherandomforestclassifier.Asmallvalueofpcanreducethecorrelationamongtheclassifiersbutmayalsoreducetheirstrength.Alargevaluecanimprovetheirstrengthbutmayresultincorrelatedtreessimilartobagging.Althoughcommonsuggestionsforpintheliteratureinclude and ,asuitablevalueofpforagiventrainingsetcanalwaysbeselectedbytuningitoveravalidationset,asdescribedinthepreviouschapter.However,thereisanalternativewayforselectinghyper-parametersinrandomforests,whichdoesnotrequireusingaseparatevalidationset.Itinvolvescomputingareliableestimateofthegeneralizationerrorratedirectlyduringtraining,knownastheout-of-bag(oob)errorestimate.Theoobestimatecanbecomputedforanygenericensemblelearningmethodthatbuildsindependentbaseclassifiersusingbootstrapsamplesofthetrainingset,e.g.,baggingandrandomforests.Theapproachforcomputingoobestimatecanbedescribedasfollows.

Consideranensemblelearningmethodthatusesanindependentbaseclassifier builtonabootstrapsampleofthetrainingset .Sinceeverytraininginstance willbeusedfortrainingapproximately63%ofbaseclassifiers,wecancall asanout-of-bagsamplefortheremaining27%ofbaseclassifiersthatdidnotuseitfortraining.Ifweusetheseremaining27%classifierstomakepredictionson ,wecanobtaintheooberroron bytakingtheirmajorityvoteandcomparingitwithitsclasslabel.Notethattheooberrorestimatestheerrorof27%classifiersonaninstancethatwasnotusedfortrainingthoseclassifiers.Hence,theooberrorcanbeconsideredasareliableestimateofgeneralizationerror.Bytakingtheaverageofooberrorsofalltraininginstances,wecancomputetheoverallooberrorestimate.Thiscanbeusedasanalternativetothevalidationerrorrateforselectinghyper-parameters.Hence,randomforestsdonotneedtouseaseparatepartitionofthetrainingsetforvalidation,asitcansimultaneouslytrainthebaseclassifiersandcomputegeneralizationerrorestimatesonthesamedataset.

d log2d+1

Ti Di

Randomforestshavebeenempiricallyfoundtoprovidesignificantimprovementsingeneralizationperformancethatareoftencomparable,ifnotsuperior,totheimprovementsprovidedbytheAdaBoostalgorithm.RandomforestsarealsomorerobusttooverfittingandrunmuchfasterthantheAdaBoostalgorithm.

4.10.7EmpiricalComparisonamongEnsembleMethods

Table4.5 showstheempiricalresultsobtainedwhencomparingtheperformanceofadecisiontreeclassifieragainstbagging,boosting,andrandomforest.Thebaseclassifiersusedineachensemblemethodconsistof50decisiontrees.Theclassificationaccuraciesreportedinthistableareobtainedfromtenfoldcross-validation.Noticethattheensembleclassifiersgenerallyoutperformasingledecisiontreeclassifieronmanyofthedatasets.

Table4.5.Comparingtheaccuracyofadecisiontreeclassifieragainstthreeensemblemethods.

DataSet Numberof(Attributes,Classes,Instances)

DecisionTree(%)

Bagging(%) Boosting(%) RF(%)

Anneal (39,6,898) 92.09 94.43 95.43 95.43

Australia (15,2,690) 85.51 87.10 85.22 85.80

Auto (26,7,205) 81.95 85.37 85.37 84.39

Breast (11,2,699) 95.14 96.42 97.28 96.14

Cleve (14,2,303) 76.24 81.52 82.18 82.18

Credit (16,2,690) 85.8 86.23 86.09 85.8

Diabetes (9,2,768) 72.40 76.30 73.18 75.13

German (21,2,1000) 70.90 73.40 73.00 74.5

Glass (10,7,214) 67.29 76.17 77.57 78.04

Heart (14,2,270) 80.00 81.48 80.74 83.33

Hepatitis (20,2,155) 81.94 81.29 83.87 83.23

Horse (23,2,368) 85.33 85.87 81.25 85.33

Ionosphere (35,2,351) 89.17 92.02 93.73 93.45

Iris (5,3,150) 94.67 94.67 94.00 93.33

Labor (17,2,57) 78.95 84.21 89.47 84.21

Led7 (8,10,3200) 73.34 73.66 73.34 73.06

Lymphography (19,4,148) 77.03 79.05 85.14 82.43

Pima (9,2,768) 74.35 76.69 73.44 77.60

Sonar (61,2,208) 78.85 78.85 84.62 85.58

Tic-tac-toe (10,2,958) 83.72 93.84 98.54 95.82

Vehicle (19,4,846) 71.04 74.11 78.25 74.94

Waveform (22,3,5000) 76.44 83.30 83.90 84.04

Wine (14,3,178) 94.38 96.07 97.75 97.75

Zoo (17,7,101) 93.07 93.07 95.05 97.03

4.11ClassImbalanceProblemInmanydatasetsthereareadisproportionatenumberofinstancesthatbelongtodifferentclasses,apropertyknownasskeworclassimbalance.Forexample,considerahealth-careapplicationwherediagnosticreportsareusedtodecidewhetherapersonhasararedisease.Becauseoftheinfrequentnatureofthedisease,wecanexpecttoobserveasmallernumberofsubjectswhoarepositivelydiagnosed.Similarly,increditcardfrauddetection,fraudulenttransactionsaregreatlyoutnumberedbylegitimatetransactions.

Thedegreeofimbalancebetweentheclassesvariesacrossdifferentapplicationsandevenacrossdifferentdatasetsfromthesameapplication.Forexample,theriskforararediseasemayvaryacrossdifferentpopulationsofsubjectsdependingontheirdietaryandlifestylechoices.However,despitetheirinfrequentoccurrences,acorrectclassificationoftherareclassoftenhasgreatervaluethanacorrectclassificationofthemajorityclass.Forexample,itmaybemoredangeroustoignoreapatientsufferingfromadiseasethantomisdiagnoseahealthyperson.

Moregenerally,classimbalanceposestwochallengesforclassification.First,itcanbedifficulttofindsufficientlymanylabeledsamplesofarareclass.Notethatmanyoftheclassificationmethodsdiscussedsofarworkwellonlywhenthetrainingsethasabalancedrepresentationofbothclasses.Althoughsomeclassifiersaremoreeffectiveathandlingimbalanceinthetrainingdatathanothers,e.g.,rule-basedclassifiersandk-NN,theyareallimpactediftheminorityclassisnotwell-representedinthetrainingset.Ingeneral,aclassifiertrainedoveranimbalanceddatasetshowsabiastowardimprovingitsperformanceoverthemajorityclass,whichisoftennotthedesiredbehavior.

Asaresult,manyexistingclassificationmodels,whentrainedonanimbalanceddataset,maynoteffectivelydetectinstancesoftherareclass.

Second,accuracy,whichisthetraditionalmeasureforevaluatingclassificationperformance,isnotwell-suitedforevaluatingmodelsinthepresenceofclassimbalanceinthetestdata.Forexample,if1%ofthecreditcardtransactionsarefraudulent,thenatrivialmodelthatpredictseverytransactionaslegitimatewillhaveanaccuracyof99%eventhoughitfailstodetectanyofthefraudulentactivities.Thus,thereisaneedtousealternativeevaluationmetricsthataresensitivetotheskewandcancapturedifferentcriteriaofperformancethanaccuracy.

Inthissection,wefirstpresentsomeofthegenericmethodsforbuildingclassifierswhenthereisclassimbalanceinthetrainingset.Wethendiscussmethodsforevaluatingclassificationperformanceandadaptingclassificationdecisionsinthepresenceofaskewedtestset.Intheremainderofthissection,wewillconsiderbinaryclassificationproblemsforsimplicity,wheretheminorityclassisreferredasthepositive classwhilethemajorityclassisreferredasthenegative class.

4.11.1BuildingClassifierswithClassImbalance

Therearetwoprimaryconsiderationsforbuildingclassifiersinthepresenceofclassimbalanceinthetrainingset.First,weneedtoensurethatthelearningalgorithmistrainedoveradatasetthathasadequaterepresentationofboththemajorityaswellastheminorityclasses.Somecommonapproachesforensuringthisincludesthemethodologiesofoversamplingandundersampling

(+)(−)

thetrainingset.Second,havinglearnedaclassificationmodel,weneedawaytoadaptitsclassificationdecisions(andthuscreateanappropriatelytunedclassifier)tobestmatchtherequirementsoftheimbalancedtestset.Thisistypicallydonebyconvertingtheoutputsoftheclassificationmodeltoreal-valuedscores,andthenselectingasuitablethresholdontheclassificationscoretomatchtheneedsofatestset.Boththeseconsiderationsarediscussedindetailinthefollowing.

OversamplingandUndersamplingThefirststepinlearningwithimbalanceddataistotransformthetrainingsettoabalancedtrainingset,wherebothclasseshavenearlyequalrepresentation.Thebalancedtrainingsetcanthenbeusedwithanyoftheexistingclassificationtechniques(withoutmakinganymodificationsinthelearningalgorithm)tolearnamodelthatgivesequalemphasistobothclasses.Inthefollowing,wepresentsomeofthecommontechniquesfortransforminganimbalancedtrainingsettoabalancedone.

Abasicapproachforcreatingbalancedtrainingsetsistogenerateasampleoftraininginstanceswheretherareclasshasadequaterepresentation.Therearetwotypesofsamplingmethodsthatcanbeusedtoenhancetherepresentationoftheminorityclass:(a)undersampling,wherethefrequencyofthemajorityclassisreducedtomatchthefrequencyoftheminorityclass,and(b)oversampling,whereartificialexamplesoftheminorityclassarecreatedtomakethemequalinproportiontothenumberofnegativeinstances.

Toillustrateundersampling,consideratrainingsetthatcontains100positiveexamplesand1000negativeexamples.Toovercometheskewamongtheclasses,wecanselectarandomsampleof100examplesfromthenegativeclassandusethemwiththe100positiveexamplestocreateabalancedtrainingset.Aclassifierbuiltovertheresultantbalancedsetwillthenbe

unbiasedtowardbothclasses.However,onelimitationofundersamplingisthatsomeoftheusefulnegativeexamples(e.g.,thoseclosertotheactualdecisionboundary)maynotbechosenfortraining,therefore,resultinginaninferiorclassificationmodel.Anotherlimitationisthatthesmallersampleof100negativeinstancesmayhaveahighervariancethanthelargersetof1000.

Oversamplingattemptstocreateabalancedtrainingsetbyartificiallygeneratingnewpositiveexamples.Asimpleapproachforoversamplingistoduplicateeverypositiveinstance times,where and arethenumbersofpositiveandnegativetraininginstances,respectively.Figure4.51 illustratestheeffectofoversamplingonthelearningofadecisionboundaryusingaclassifiersuchasadecisiontree.Withoutoversampling,onlythepositiveexamplesatthebottomright-handsideofFigure4.51(a)areclassifiedcorrectly.Thepositiveexampleinthemiddleofthediagramismisclassifiedbecausetherearenotenoughexamplestojustifythecreationofanewdecisionboundarytoseparatethepositiveandnegativeinstances.Oversamplingprovidestheadditionalexamplesneededtoensurethatthedecisionboundarysurroundingthepositiveexampleisnotpruned,asillustratedinFigure4.51(b) .Notethatduplicatingapositiveinstanceisanalogoustodoublingitsweightduringthetrainingstage.Hence,theeffectofoversamplingcanbealternativelyachievedbyassigninghigherweightstopositiveinstancesthannegativeinstances.Thismethodofweightinginstancescanbeusedwithanumberofclassifierssuchaslogisticregression,ANN,andSVM.

n−/n+ n+ n−

Figure4.51.Illustratingtheeffectofoversamplingoftherareclass.

Onelimitationoftheduplicationmethodforoversamplingisthatthereplicatedpositiveexampleshaveanartificiallylowervariancewhencomparedwiththeirtruedistributionintheoveralldata.Thiscanbiastheclassifiertothespecificdistributionoftraininginstances,whichmaynotberepresentativeoftheoveralldistributionoftestinstances,leadingtopoorgeneralizability.Toovercomethislimitation,analternativeapproachforoversamplingistogeneratesyntheticpositiveinstancesintheneighborhoodofexistingpositiveinstances.Inthisapproach,calledtheSyntheticMinorityOversamplingTechnique(SMOTE),wefirstdeterminethek-nearestpositiveneighborsofeverypositiveinstance ,andthengenerateasyntheticpositiveinstanceatsomeintermediatepointalongthelinesegmentjoining tooneofitsrandomlychosenk-nearestneighbor, .Thisprocessisrepeateduntilthedesirednumberofpositiveinstancesisreached.However,onelimitationofthisapproachisthatitcanonlygeneratenewpositiveinstancesintheconvexhulloftheexistingpositiveclass.Hence,itdoesnothelpimprovetherepresentationofthepositiveclassoutsidetheboundaryofexistingpositive

xk

instances.Despitetheircomplementarystrengthsandweaknesses,undersamplingandoversamplingprovideusefuldirectionsforgeneratingbalancedtrainingsetsinthepresenceofclassimbalance.

AssigningScorestoTestInstancesIfaclassifierreturnsanordinalscores( )foreverytestinstance suchthatahigherscoredenotesagreaterlikelihoodof belongingtothepositiveclass,thenforeverypossiblevalueofscorethreshold, ,wecancreateanewbinaryclassifierwhereatestinstance isclassifiedpositiveonlyif .Thus,everychoiceof canpotentiallyleadtoadifferentclassifier,andweareinterestedinfindingtheclassifierthatisbestsuitedforourneeds.

Ideally,wewouldliketheclassificationscoretovarymonotonicallywiththeactualposteriorprobabilityofthepositiveclass,i.e.,if and arethescoresofanytwoinstances, and ,then

.However,thisisdifficulttoguaranteeinpracticeasthepropertiesoftheclassificationscoredependsonseveralfactorssuchasthecomplexityoftheclassificationalgorithmandtherepresentativepowerofthetrainingset.Ingeneral,wecanonlyexpecttheclassificationscoreofareasonablealgorithmtobeweaklyrelatedtotheactualposteriorprobabilityofthepositiveclass,eventhoughtherelationshipmaynotbestrictlymonotonic.Mostclassifierscanbeeasilymodifiedtoproducesucharealvaluedscore.Forexample,thesigneddistanceofaninstancefromthepositivemarginhyperplaneofSVMcanbeusedasaclassificationscore.Asanotherexample,testinstancesbelongingtoaleafinadecisiontreecanbeassignedascorebasedonthefractionoftraininginstanceslabeledaspositiveintheleaf.Also,probabilisticclassifierssuchasnaïveBayes,Bayesiannetworks,andlogisticregressionnaturallyoutputestimatesofposteriorprobabilities, .Next,wediscusssome

sTs(x)>sT

sT

s(x1) s(x2)x1 x2

s(x1)≥s(x2)⇒P(y=1|x1)≥P(y=1|x2)

P(y=1|x)

evaluationmeasuresforassessingthegoodnessofaclassifierinthepresenceofclassimbalance.

Table4.6.Aconfusionmatrixforabinaryclassificationprobleminwhichtheclassesarenotequallyimportant.

PredictedClass

Actualclass

4.11.2EvaluatingPerformancewithClassImbalance

Themostbasicapproachforrepresentingaclassifier'sperformanceonatestsetistouseaconfusionmatrix,asshowninTable4.6 .ThistableisessentiallythesameasTable3.4 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection3.2 .Aconfusionmatrixsummarizesthenumberofinstancespredictedcorrectlyorincorrectlybyaclassifierusingthefollowingfourcounts:

Truepositive(TP)or ,whichcorrespondstothenumberofpositiveexamplescorrectlypredictedbytheclassifier.Falsepositive(FP)or (alsoknownasTypeIerror),whichcorrespondstothenumberofnegativeexampleswronglypredictedaspositivebytheclassifier.

+ −

+ f++(TP) f+−(FN)

− f−+(FP) f−−(TN)

f++

f−+

Falsenegative(FN)or (alsoknownasTypeIIerror),whichcorrespondstothenumberofpositiveexampleswronglypredictedasnegativebytheclassifier.Truenegative(TN)or ,whichcorrespondstothenumberofnegativeexamplescorrectlypredictedbytheclassifier.

Theconfusionmatrixprovidesaconciserepresentationofclassificationperformanceonagiventestdataset.However,itisoftendifficulttointerpretandcomparetheperformanceofclassifiersusingthefour-dimensionalrepresentations(correspondingtothefourcounts)providedbytheirconfusionmatrices.Hence,thecountsintheconfusionmatrixareoftensummarizedusinganumberofevaluationmeasures.Accuracyisanexampleofonesuchmeasurethatcombinesthesefourcountsintoasinglevalue,whichisusedextensivelywhenclassesarebalanced.However,theaccuracymeasureisnotsuitableforhandlingdatasetswithimbalancedclassdistributionsasittendstofavorclassifiersthatcorrectlyclassifythemajorityclass.Inthefollowing,wedescribeotherpossiblemeasuresthatcapturedifferentcriteriaofperformancewhenworkingwithimbalancedclasses.

Abasicevaluationmeasureisthetruepositiverate(TPR),whichisdefinedasthefractionofpositivetestinstancescorrectlypredictedbytheclassifier:

Inthemedicalcommunity,TPRisalsoknownassensitivity,whileintheinformationretrievalliterature,itisalsocalledrecall(r).AclassifierwithahighTPRhasahighchanceofcorrectlyidentifyingthepositiveinstancesofthedata.

AnalogouslytoTPR,thetruenegativerate(TNR)(alsoknownasspecificity)isdefinedasthefractionofnegativetestinstancescorrectly

f+−

f−−

TPR=TPTP+FN.

predictedbytheclassifier,i.e.,

AhighTNRvaluesignifiesthattheclassifiercorrectlyclassifiesanyrandomlychosennegativeinstanceinthetestset.AcommonlyusedevaluationmeasurethatiscloselyrelatedtoTNRisthefalsepositiverate(FPR),whichisdefinedas .

Similarly,wecandefinefalsenegativerate(FNR)as .

Notethattheevaluationmeasuresdefinedabovedonottakeintoaccounttheskewamongtheclasses,whichcanbeformallydefinedas ,wherePandNdenotethenumberofactualpositivesandactualnegatives,respectively.Asaresult,changingtherelativenumbersofPandNwillhavenoeffectonTPR,TNR,FPR,orFNR,sincetheydependonlyonthefractionofcorrectclassificationsforeveryclass,independentlyoftheotherclass.Furthermore,knowingthevaluesofTPRandTNR(andconsequentlyFNRandFPR)doesnotbyitselfhelpusuniquelydetermineallfourentriesoftheconfusionmatrix.However,togetherwithinformationabouttheskewfactor, ,andthetotalnumberofinstances,N,wecancomputetheentireconfusionmatrixusingTPRandTNR,asshowninTable4.7 .

Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,TNR,skew, ,andtotalnumberofinstances,N.

Predicted Predicted

TNR=TNFP+TN.

1−TNR

FPR=FPFP+TN.

1−TPR

FNR=FNFN+TP.

α=P/(P+N)

α

α

+ −

Actual

Actual

N

Anevaluationmeasurethatissensitivetotheskewisprecision,whichcanbedefinedasthefractionofcorrectpredictionsofthepositiveclassoverthetotalnumberofpositivepredictions,i.e.,

Precisionisalsoreferredasthepositivepredictedvalue(PPV).Aclassifierthathasahighprecisionislikelytohavemostofitspositivepredictionscorrect.Precisionisausefulmeasureforhighlyskewedtestsetswherethepositivepredictions,eventhoughsmallinnumbers,arerequiredtobemostlycorrect.Ameasurethatiscloselyrelatedtoprecisionisthefalsediscoveryrate(FDR),whichcanbedefinedas .

AlthoughbothFDRandFPRfocusonFP,theyaredesignedtocapturedifferentevaluationobjectivesandthuscantakequitecontrastingvalues,especiallyinthepresenceofclassimbalance.Toillustratethis,consideraclassifierwiththefollowingconfusionmatrix.

PredictedClass

ActualClass

100 0

+ TPR×α×N (1−TPR)×α×N α×N

− (1−TNR)×(1−α)×N TNR×(1−α)×N (1−α)×N

Precision,p=TPTP+FP.

1−p

FDR=FPTP+FP.

+ −

+

100 900

Sincehalfofthepositivepredictionsmadebytheclassifierareincorrect,ithasaFDRvalueof .However,itsFPRisequalto

,whichisquitelow.Thisexampleshowsthatinthepresenceofhighskew(i.e.,verysmallvalueof ),evenasmallFPRcanresultinhighFDR.SeeSection10.6 forfurtherdiscussionofthisissue.

Notethattheevaluationmeasuresdefinedaboveprovideanincompleterepresentationofperformance,becausetheyeitheronlycapturetheeffectoffalsepositives(e.g.,FPRandprecision)ortheeffectoffalsenegatives(e.g.,TPRorrecall),butnotboth.Hence,ifweoptimizeonlyoneoftheseevaluationmeasures,wemayendupwithaclassifierthatshowslowFNbuthighFP,orvice-versa.Forexample,aclassifierthatdeclareseveryinstancetobepositivewillhaveaperfectrecall,buthighFPRandverypoorprecision.Ontheotherhand,aclassifierthatisveryconservativeinclassifyinganinstanceaspositive(toreduceFP)mayenduphavinghighprecisionbutverypoorrecall.Wethusneedevaluationmeasuresthataccountforbothtypesofmisclassifications,FPandFN.Someexamplesofsuchevaluationmeasuresaresummarizedbythefollowingdefinitions.

Whilesomeoftheseevaluationmeasuresareinvarianttotheskew(e.g.,thepositivelikelihoodratio),others(e.g.,precisionandthe measure)aresensitivetoskew.Further,differentevaluationmeasurescapturetheeffectsofdifferenttypesofmisclassificationerrorsinvariousways.Forexample,themeasurerepresentsaharmonicmeanbetweenrecallandprecision,i.e.,

100/(100+100)=0.5100/(100+900)=0.1

α

PositiveLikelihoodRatio=TPRFPR.F1measure=2rpr+p=2×TP2×TP+FP+FN.G(TP+FN).

F1

F1

F1=21r+1p.

Becausetheharmonicmeanoftwonumberstendstobeclosertothesmallerofthetwonumbers,ahighvalueof -measureensuresthatbothprecisionandrecallarereasonablyhigh.Similarly,theGmeasurerepresentsthegeometricmeanbetweenrecallandprecision.Acomparisonamongharmonic,geometric,andarithmeticmeansisgiveninthenextexample.

Example4.9.Considertwopositivenumbers and .Theirarithmeticmeanis

andtheirgeometricmeanis .Theirharmonicmeanis ,whichisclosertothesmallervaluebetweenaandbthanthearithmeticandgeometricmeans.

Agenericextensionofthe measureisthe measure,whichcanbedefinedasfollows.

Bothprecisionandrecallcanbeviewedasspecialcasesof bysettingand ,respectively.Lowvaluesof make closertoprecision,andhighvaluesmakeitclosertorecall.

Amoregeneralmeasurethatcaptures aswellasaccuracyistheweightedaccuracymeasure,whichisdefinedbythefollowingequation:

Therelationshipbetweenweightedaccuracyandotherperformancemeasuresissummarizedinthefollowingtable:

Measure

F1

a=1 b=5 μa=(a+b)/2=3 μg=ab=2.236μh=(2×1×5)/6=1.667

F1 Fβ

Fβ=(β2+1)rpr+β2p=(β2+1)×TP(β2+1)TP+β2FP+FN. (4.106)

Fβ β=0β=∞ β Fβ

Weightedaccuracy=w1TP+w4TNw1TP+w2FP+w3FN+w4TN. (4.107)

w1 w2 w3 w4

Recall 1 1 0 0

Precision 1 0 1 0

1 0

Accuracy 1 1 1 1

4.11.3FindinganOptimalScoreThreshold

GivenasuitablychosenevaluationmeasureEandadistributionofclassificationscores, ,onavalidationset,wecanobtaintheoptimalscorethreshold onthevalidationsetusingthefollowingapproach:

1. Sortthescoresinincreasingorderoftheirvalues.2. Foreveryuniquevalueofscore,s,considertheclassificationmodel

thatassignsaninstance aspositiveonlyif .LetE(s)denotetheperformanceofthismodelonthevalidationset.

3. Find thatmaximizestheevaluationmeasure,E(s).

Notethat canbetreatedasahyper-parameteroftheclassificationalgorithmthatislearnedduringmodelselection.Using ,wecanassignapositivelabeltoafuturetestinstance onlyif .IftheevaluationmeasureEisskewinvariant(e.g.,PositiveLikelihoodRatio),thenwecanselect withoutknowingtheskewofthetestset,andtheresultantclassifierformedusing canbeexpectedtoshowoptimalperformanceonthetestset

Fβ β2+1 β2

s(x)s*

s(x)>s

s*s*=argmaxsE(s).

s*s*

s(x)>s*

s*s*

(withrespecttotheevaluationmeasureE).Ontheotherhand,ifEissensitivetotheskew(e.g.,precisionor -measure),thenweneedtoensurethattheskewofthevalidationsetusedforselecting issimilartothatofthetestset,sothattheclassifierformedusing showsoptimaltestperformancewithrespecttoE.Alternatively,givenanestimateoftheskewofthetestdata, ,wecanuseitalongwiththeTPRandTNRonthevalidationsettoestimateallentriesoftheconfusionmatrix(seeTable4.7 ),andthustheestimateofanyevaluationmeasureEonthetestset.Thescorethreshold selectedusingthisestimateofEcanthenbeexpectedtoproduceoptimaltestperformancewithrespecttoE.Furthermore,themethodologyofselectingonthevalidationsetcanhelpincomparingthetestperformanceofdifferentclassificationalgorithms,byusingtheoptimalvaluesof foreachalgorithm.

4.11.4AggregateEvaluationofPerformance

Althoughtheaboveapproachhelpsinfindingascorethreshold thatprovidesoptimalperformancewithrespecttoadesiredevaluationmeasureandacertainamountofskew, ,sometimesweareinterestedinevaluatingtheperformanceofaclassifieronanumberofpossiblescorethresholds,eachcorrespondingtoadifferentchoiceofevaluationmeasureandskewvalue.Assessingtheperformanceofaclassifieroverarangeofscorethresholdsiscalledaggregateevaluationofperformance.Inthisstyleofanalysis,theemphasisisnotonevaluatingtheperformanceofasingleclassifiercorrespondingtotheoptimalscorethreshold,buttoassessthegeneralqualityofrankingproducedbytheclassificationscoresonthetestset.Ingeneral,thishelpsinobtainingrobustestimatesofclassificationperformancethatarenotsensitivetospecificchoicesofscorethresholds.

F1s*

s*α

s*

s*

s*

s*

α

ROCCurveOneofthewidely-usedtoolsforaggregateevaluationisthereceiveroperatingcharacteristic(ROC)curve.AnROCcurveisagraphicalapproachfordisplayingthetrade-offbetweenTPRandFPRofaclassifier,overvaryingscorethresholds.InanROCcurve,theTPRisplottedalongthey-axisandtheFPRisshownonthex-axis.Eachpointalongthecurvecorrespondstoaclassificationmodelgeneratedbyplacingathresholdonthetestscoresproducedbytheclassifier.ThefollowingproceduredescribesthegenericapproachforcomputinganROCcurve:

1. Sortthetestinstancesinincreasingorderoftheirscores.2. Selectthelowestrankedtestinstance(i.e.,theinstancewithlowest

score).Assigntheselectedinstanceandthoserankedaboveittothepositiveclass.Thisapproachisequivalenttoclassifyingallthetestinstancesaspositiveclass.Becauseallthepositiveexamplesareclassifiedcorrectlyandthenegativeexamplesaremisclassified,

.3. Selectthenexttestinstancefromthesortedlist.Classifytheselected

instanceandthoserankedaboveitaspositive,whilethoserankedbelowitasnegative.UpdatethecountsofTPandFPbyexaminingtheactualclasslabeloftheselectedinstance.Ifthisinstancebelongstothepositiveclass,theTPcountisdecrementedandtheFPcountremainsthesameasbefore.Iftheinstancebelongstothenegativeclass,theFPcountisdecrementedandTPcountremainsthesameasbefore.

4. RepeatStep3andupdatetheTPandFPcountsaccordinglyuntilthehighestrankedtestinstanceisselected.Atthisfinalthreshold,

,asallinstancesarelabeledasnegative.5. PlottheTPRagainstFPRoftheclassifier.

TPR=FPR=1

TPR=FPR=0

Example4.10.[GeneratingROCCurve]Figure4.52 showsanexampleofhowtocomputetheTPRandFPRvaluesforeverychoiceofscorethreshold.Therearefivepositiveexamplesandfivenegativeexamplesinthetestset.Theclasslabelsofthetestinstancesareshowninthefirstrowofthetable,whilethesecondrowcorrespondstothesortedscorevaluesforeachinstance.ThenextsixrowscontainthecountsofTP,FP,TN,andFN,alongwiththeircorrespondingTPRandFPR.Thetableisthenfilledfromlefttoright.Initially,alltheinstancesarepredictedtobepositive.Thus, and

.Next,weassignthetestinstancewiththelowestscoreasthenegativeclass.Becausetheselectedinstanceisactuallyapositiveexample,theTPcountdecreasesfrom5to4andtheFPcountisthesameasbefore.TheFPRandTPRareupdatedaccordingly.Thisprocessisrepeateduntilwereachtheendofthelist,where and .TheROCcurveforthisexampleisshowninFigure4.53 .

Figure4.52.ComputingtheTPRandFPRateveryscorethreshold.

TP=FP=5TPR=FPR=1

TPR=0 FPR=0

Figure4.53.ROCcurveforthedatashowninFigure4.52 .

NotethatinanROCcurve,theTPRmonotonicallyincreaseswithFPR,becausetheinclusionofatestinstanceinthesetofpredictedpositivescaneitherincreasetheTPRortheFPR.TheROCcurvethushasastaircasepattern.Furthermore,thereareseveralcriticalpointsalonganROCcurvethathavewell-knowninterpretations:

:Modelpredictseveryinstancetobeanegativeclass.

:Modelpredictseveryinstancetobeapositiveclass.

:Theperfectmodelwithzeromisclassifications.

Agoodclassificationmodelshouldbelocatedascloseaspossibletotheupperleftcornerofthediagram,whileamodelthatmakesrandomguessesshouldresidealongthemaindiagonal,connectingthepointsand .Randomguessingmeansthataninstanceisclassifiedasapositiveclasswithafixedprobabilityp,irrespectiveofitsattributeset.

(TPR=0,FPR=0)

(TPR=1,FPR=1)

(TPR=1,FPR=0)

(TPR=0,FPR=0)(TPR=1,FPR=1)

Forexample,consideradatasetthatcontains positiveinstancesandnegativeinstances.Therandomclassifierisexpectedtocorrectlyclassifyofthepositiveinstancesandtomisclassify ofthenegativeinstances.Therefore,theTPRoftheclassifieris ,whileitsFPRis .Hence,thisrandomclassifierwillresideatthepoint(p,p)intheROCcurvealongthemaindiagonal.

Figure4.54.ROCcurvesfortwodifferentclassifiers.

SinceeverypointontheROCcurverepresentstheperformanceofaclassifiergeneratedusingaparticularscorethreshold,theycanbeviewedasdifferentoperatingpointsoftheclassifier.Onemaychooseoneoftheseoperatingpointsdependingontherequirementsoftheapplication.Hence,anROCcurvefacilitatesthecomparisonofclassifiersoverarangeofoperatingpoints.Forexample,Figure4.54 comparestheROCcurvesoftwoclassifiers,

n+ n−pn+

pn−(pn+)/n+=p (pn−)/p=p

M1

and ,generatedbyvaryingthescorethresholds.Wecanseethat isbetterthan whenFPRislessthan0.36,as showsbetterTPRthanforthisrangeofoperatingpoints.Ontheotherhand, issuperiorwhenFPRisgreaterthan0.36,sincetheTPRof ishigherthanthatof forthisrange.Clearly,neitherofthetwoclassifiersdominates(isstrictlybetterthan)theother,i.e.,showshighervaluesofTPRandlowervaluesofFPRoveralloperatingpoints.

Tosummarizetheaggregatebehavioracrossalloperatingpoints,oneofthecommonlyusedmeasuresistheareaundertheROCcurve(AUC).Iftheclassifierisperfect,thenitsareaundertheROCcurvewillbeequal1.Ifthealgorithmsimplyperformsrandomguessing,thenitsareaundertheROCcurvewillbeequalto0.5.

AlthoughtheAUCprovidesausefulsummaryofaggregateperformance,therearecertaincaveatsinusingtheAUCforcomparingclassifiers.First,eveniftheAUCofalgorithmAishigherthantheAUCofanotheralgorithmB,thisdoesnotmeanthatalgorithmAisalwaysbetterthanB,i.e.,theROCcurveofAdominatesthatofBacrossalloperatingpoints.Forexample,eventhough showsaslightlylowerAUCthan inFigure4.54 ,wecanseethatboth and areusefuloverdifferentrangesofoperatingpointsandnoneofthemarestrictlybetterthantheotheracrossallpossibleoperatingpoints.Hence,wecannotusetheAUCtodeterminewhichalgorithmisbetter,unlessweknowthattheROCcurveofoneofthealgorithmsdominatestheother.

Second,althoughtheAUCsummarizestheaggregateperformanceoveralloperatingpoints,weareofteninterestedinonlyasmallrangeofoperatingpointsinmostapplications.Forexample,eventhough showsslightlylowerAUCthan ,itshowshigherTPRvaluesthan forsmallFPRvalues(smallerthan0.36).Inthepresenceofclassimbalance,thebehaviorof

M2 M1M2 M1 M2

M2M2 M1

M1 M2M1 M2

M1M2 M2

analgorithmoversmallFPRvalues(alsotermedasearlyretrieval)isoftenmoremeaningfulforcomparisonthantheperformanceoverallFPRvalues.Thisisbecause,inmanyapplications,itisimportanttoassesstheTPRachievedbyaclassifierinthefirstfewinstanceswithhighestscores,withoutincurringalargeFPR.Hence,inFigure4.54 ,duetothehighTPRvaluesof duringearlyretrieval ,wemayprefer over forimbalancedtestsets,despitethelowerAUCof .Hence,caremustbetakenwhilecomparingtheAUCvaluesofdifferentclassifiers,usuallybyvisualizingtheirROCcurvesratherthanjustreportingtheirAUC.

AkeycharacteristicofROCcurvesisthattheyareagnostictotheskewinthetestset,becauseboththeevaluationmeasuresusedinconstructingROCcurves(TPRandFPR)areinvarianttoclassimbalance.Hence,ROCcurvesarenotsuitableformeasuringtheimpactofskewonclassificationperformance.Inparticular,wewillobtainthesameROCcurvefortwotestdatasetsthathaveverydifferentskew.

M1 (FPR<0.36) M1 M2M1

Figure4.55.PRcurvesfortwodifferentclassifiers.

Precision-RecallCurveAnalternatetoolforaggregateevaluationistheprecisionrecallcurve(PRcurve).ThePRcurveplotstheprecisionandrecallvaluesofaclassifierontheyandxaxesrespectively,byvaryingthethresholdonthetestscores.Figure4.55 showsanexampleofPRcurvesfortwohypotheticalclassifiers, and .TheapproachforgeneratingaPRcurveissimilartotheapproachdescribedaboveforgeneratinganROCcurve.However,therearesomekeydistinguishingfeaturesinthePRcurve:

1. PRcurvesaresensitivetotheskewfactor ,anddifferentPRcurvesaregeneratedfordifferentvaluesof .

M1 M2

α=P/(P+N)α

2. Whenthescorethresholdislowest(everyinstanceislabeledaspositive),theprecisionisequalto whilerecallis1.Asweincreasethescorethreshold,thenumberofpredictedpositivescanstaythesameordecrease.Hence,therecallmonotonicallydeclinesasthescorethresholdincreases.Ingeneral,theprecisionmayincreaseordecreaseforthesamevalueofrecall,uponadditionofaninstanceintothesetofpredictedpositives.Forexample,ifthek rankedinstancebelongstothenegativeclass,thenincludingitwillresultinadropintheprecisionwithoutaffectingtherecall.Theprecisionmayimproveatthenextstep,whichaddsthe rankedinstance,ifthisinstancebelongstothepositiveclass.Hence,thePRcurveisnotasmooth,monotonicallyincreasingcurveliketheROCcurve,andgenerallyhasazigzagpattern.Thispatternismoreprominentintheleftpartofthecurve,whereevenasmallchangeinthenumberoffalsepositivescancausealargechangeinprecision.

3. As,asweincreasetheimbalanceamongtheclasses(reducethevalueof ),therightmostpointsofallPRcurveswillmovedownwards.AtandneartheleftmostpointonthePRcurve(correspondingtolargervaluesofscorethreshold),therecallisclosetozero,whiletheprecisionisequaltothefractionofpositivesinthetoprankedinstancesofthealgorithm.Hence,differentclassifierscanhavedifferentvaluesofprecisionattheleftmostpointsofthePRcurve.Also,iftheclassificationscoreofanalgorithmmonotonicallyvarieswiththeposteriorprobabilityofthepositiveclass,wecanexpectthePRcurvetograduallydecreasefromahighvalueofprecisionontheleftmostpointtoaconstantvalueof attherightmostpoint,albeitwithsomeupsanddowns.ThiscanbeobservedinthePRcurveofalgorithminFigure4.55 ,whichstartsfromahighervalueofprecisionontheleftthatgraduallydecreasesaswemovetowardstheright.Ontheotherhand,thePRcurveofalgorithm startsfromalowervalueofprecisionontheleftandshowsmoredrasticupsanddownsaswe

α

th

(k+1)th

α

αM1

M2

moveright,suggestingthattheclassificationscoreof showsaweakermonotonicrelationshipwiththeposteriorprobabilityofthepositiveclass.

4. Arandomclassifierthatassignsaninstancetobepositivewithafixedprobabilityphasaprecisionof andarecallofp.Hence,aclassifierthatperformsrandomguessinghasahorizontalPRcurvewith ,asshownusingadashedlineinFigure4.55 .NotethattherandombaselineinPRcurvesdependsontheskewinthetestset,incontrasttothefixedmaindiagonalofROCcurvesthatrepresentsrandomclassifiers.

5. NotethattheprecisionofanalgorithmisimpactedmorestronglybyfalsepositivesinthetoprankedtestinstancesthantheFPRofthealgorithm.Forthisreason,thePRcurvegenerallyhelpstomagnifythedifferencesbetweenclassifiersintheleftportionofthePRcurve.Hence,inthepresenceofclassimbalanceinthetestdata,analyzingthePRcurvesgenerallyprovidesadeeperinsightintotheperformanceofclassifiersthantheROCcurves,especiallyintheearlyretrievalrangeofoperatingpoints.

6. Theclassifiercorrespondingto representstheperfectclassifier.SimilartoAUC,wecanalsocomputetheareaunderthePRcurveofanalgorithm,knownasAUC-PR.TheAUC-PRofarandomclassifierisequalto ,whilethatofaperfectalgorithmisequalto1.NotethatAUC-PRvarieswithchangingskewinthetestset,incontrasttotheareaundertheROCcurve,whichisinsensitivetotheskew.TheAUC-PRhelpsinaccentuatingthedifferencesbetweenclassificationalgorithmsintheearlyretrievalrangeofoperatingpoints.Hence,itismoresuitedforevaluatingclassificationperformanceinthepresenceofclassimbalancethantheareaundertheROCcurve.However,similartoROCcurves,ahighervalueofAUC-PRdoesnotguaranteethesuperiorityofaclassificationalgorithmoveranother.ThisisbecausethePRcurvesoftwoalgorithmscaneasilycrosseach

M2

αy=α

(precision=1,recall=1)

α

other,suchthattheybothshowbetterperformancesindifferentrangesofoperatingpoints.Hence,itisimportanttovisualizethePRcurvesbeforecomparingtheirAUC-PRvalues,inordertoensureameaningfulevaluation.

4.12MulticlassProblemSomeoftheclassificationtechniquesdescribedinthischapterareoriginallydesignedforbinaryclassificationproblems.Yettherearemanyreal-worldproblems,suchascharacterrecognition,faceidentification,andtextclassification,wheretheinputdataisdividedintomorethantwocategories.Thissectionpresentsseveralapproachesforextendingthebinaryclassifierstohandlemulticlassproblems.Toillustratetheseapproaches,let

bethesetofclassesoftheinputdata.

ThefirstapproachdecomposesthemulticlassproblemintoKbinaryproblems.Foreachclass ,abinaryproblemiscreatedwhereallinstancesthatbelongto areconsideredpositiveexamples,whiletheremaininginstancesareconsiderednegativeexamples.Abinaryclassifieristhenconstructedtoseparateinstancesofclass fromtherestoftheclasses.Thisisknownastheone-against-rest(1-r)approach.

Thesecondapproach,whichisknownastheone-against-one(1-1)approach,constructs binaryclassifiers,whereeachclassifierisusedtodistinguishbetweenapairofclasses, .Instancesthatdonotbelongtoeither or areignoredwhenconstructingthebinaryclassifierfor .Inboth1-rand1-1approaches,atestinstanceisclassifiedbycombiningthepredictionsmadebythebinaryclassifiers.Avotingschemeistypicallyemployedtocombinethepredictions,wheretheclassthatreceivesthehighestnumberofvotesisassignedtothetestinstance.Inthe1-rapproach,ifaninstanceisclassifiedasnegative,thenallclassesexceptforthepositiveclassreceiveavote.Thisapproach,however,mayleadtotiesamongthedifferentclasses.Anotherpossibilityistotransformtheoutputsofthebinary

Y={y1,y2,…,yK}

yi∈Yyi

yi

K(K−1)/2(yi,yj)

yi yj (yi,yj)

classifiersintoprobabilityestimatesandthenassignthetestinstancetotheclassthathasthehighestprobability.

Example4.11.Consideramulticlassproblemwhere .Supposeatestinstanceisclassifiedas accordingtothe1-rapproach.Inotherwords,itisclassifiedaspositivewhen isusedasthepositiveclassandnegativewhen ,and areusedasthepositiveclass.Usingasimplemajorityvote,noticethat receivesthehighestnumberofvotes,whichisfour,whiletheremainingclassesreceiveonlythreevotes.Thetestinstanceisthereforeclassifiedas .

Example4.12.Supposethetestinstanceisclassifiedusingthe1-1approachasfollows:

Binarypairofclasses

Classification

Thefirsttworowsinthistablecorrespondtothepairofclasseschosentobuildtheclassifierandthelastrowrepresentsthepredictedclassforthetestinstance.Aftercombiningthepredictions, and eachreceivetwovotes,while and eachreceivesonlyonevote.Thetestinstanceisthereforeclassifiedaseither or ,dependingonthetie-breakingprocedure.

Error-CorrectingOutputCoding

Y={y1,y2,y3,y4}(+,−,−,−)

y1y2,y3 y4

y1

y1

+:y1−:y2 +:y1−:y3 +:y1−:y4 +:y2−:y3 +:y2−:y4 +:y3−:y4

+ + − + − +

(yi,yj)

y1 y4y2 y3

y1 y4

Apotentialproblemwiththeprevioustwoapproachesisthattheymaybesensitivetobinaryclassificationerrors.Forthe1-rapproachgiveninExample

4.12,ifatleastofoneofthebinaryclassifiersmakesamistakeinitsprediction,thentheclassifiermayendupdeclaringatiebetweenclassesormakingawrongprediction.Forexample,supposethetestinstanceisclassifiedas duetomisclassificationbythethirdclassifier.Inthiscase,itwillbedifficulttotellwhethertheinstanceshouldbeclassifiedas or,unlesstheprobabilityassociatedwitheachclasspredictionistakeninto

account.

Theerror-correctingoutputcoding(ECOC)methodprovidesamorerobustwayforhandlingmulticlassproblems.Themethodisinspiredbyaninformation-theoreticapproachforsendingmessagesacrossnoisychannels.

Theideabehindthisapproachistoaddredundancyintothetransmittedmessagebymeansofacodeword,sothatthereceivermaydetecterrorsinthereceivedmessageandperhapsrecovertheoriginalmessageifthenumberoferrorsissmall.

Formulticlasslearning,eachclass isrepresentedbyauniquebitstringoflengthnknownasitscodeword.Wethentrainnbinaryclassifierstopredicteachbitofthecodewordstring.ThepredictedclassofatestinstanceisgivenbythecodewordwhoseHammingdistanceisclosesttothecodewordproducedbythebinaryclassifiers.RecallthattheHammingdistancebetweenapairofbitstringsisgivenbythenumberofbitsthatdiffer.

Example4.13.Consideramulticlassproblemwhere .Supposeweencodetheclassesusingthefollowingsevenbitcodewords:

(+,−,+,−)y1

y3

yi

Y={y1,y2,y3,y4}

Class Codeword

1 1 1 1 1 1 1

0 0 0 0 1 1 1

0 0 1 1 0 0 1

0 1 0 1 0 1 0

Eachbitofthecodewordisusedtotrainabinaryclassifier.Ifatestinstanceisclassifiedas(0,1,1,1,1,1,1)bythebinaryclassifiers,thentheHammingdistancebetweenthecodewordand is1,whiletheHammingdistancetotheremainingclassesis3.Thetestinstanceisthereforeclassifiedas .

Aninterestingpropertyofanerror-correctingcodeisthatiftheminimumHammingdistancebetweenanypairofcodewordsisd,thenanyerrorsintheoutputcodecanbecorrectedusingitsnearestcodeword.InExample4.13 ,becausetheminimumHammingdistancebetweenanypairofcodewordsis4,theclassifiermaytolerateerrorsmadebyoneofthesevenbinaryclassifiers.Ifthereismorethanoneclassifierthatmakesamistake,thentheclassifiermaynotbeabletocompensatefortheerror.

Animportantissueishowtodesigntheappropriatesetofcodewordsfordifferentclasses.Fromcodingtheory,avastnumberofalgorithmshavebeendevelopedforgeneratingn-bitcodewordswithboundedHammingdistance.However,thediscussionofthesealgorithmsisbeyondthescopeofthisbook.Itisworthwhilementioningthatthereisasignificantdifferencebetweenthedesignoferror-correctingcodesforcommunicationtaskscomparedtothoseusedformulticlasslearning.Forcommunication,thecodewordsshouldmaximizetheHammingdistancebetweentherowssothaterrorcorrection

y1

y2

y3

y4

y1

y1

⌊(d−1)/2)⌋

canbeperformed.Multiclasslearning,however,requiresthatboththerow-wiseandcolumn-wisedistancesofthecodewordsmustbewellseparated.Alargercolumn-wisedistanceensuresthatthebinaryclassifiersaremutuallyindependent,whichisanimportantrequirementforensemblelearningmethods.

4.13BibliographicNotesMitchell[278]providesexcellentcoverageonmanyclassificationtechniquesfromamachinelearningperspective.ExtensivecoverageonclassificationcanalsobefoundinAggarwal[195],Dudaetal.[229],Webb[307],Fukunaga[237],Bishop[204],Hastieetal.[249],CherkasskyandMulier[215],WittenandFrank[310],Handetal.[247],HanandKamber[244],andDunham[230].

Directmethodsforrule-basedclassifierstypicallyemploythesequentialcoveringschemeforinducingclassificationrules.Holte's1R[255]isthesimplestformofarule-basedclassifierbecauseitsrulesetcontainsonlyasinglerule.Despiteitssimplicity,Holtefoundthatforsomedatasetsthatexhibitastrongone-to-onerelationshipbetweentheattributesandtheclasslabel,1Rperformsjustaswellasotherclassifiers.Otherexamplesofrule-basedclassifiersincludeIREP[234],RIPPER[218],CN2[216,217],AQ[276],RISE[224],andITRULE[296].Table4.8 showsacomparisonofthecharacteristicsoffouroftheseclassifiers.

Table4.8.Comparisonofvariousrule-basedclassifiers.

RIPPER CN2(unordered)

CN2(ordered)

AQR

Rule-growingstrategy

General-to-specific General-to-specific

General-to-specific

General-to-specific(seededbyapositive

example)

Evaluationmetric FOIL'sInfogain Laplace Entropyandlikelihoodratio

Numberoftruepositives

Stoppingconditionforrule-growing

Allexamplesbelongtothesame

class

Noperformance

gain

Noperformance

gain

Rulescoveronlypositiveclass

Rulepruning Reducederrorpruning

None None None

Instanceelimination

Positiveandnegative

Positiveonly Positiveonly Positiveandnegative

Stoppingconditionforaddingrules

orbasedonMDLNo

performancegain

Noperformance

gain

Allpositiveexamplesarecovered

Rulesetpruning Replaceormodifyrules

Statisticaltests

None None

Searchstrategy Greedy Beamsearch

Beamsearch

Beamsearch

Forrule-basedclassifiers,theruleantecedentcanbegeneralizedtoincludeanypropositionalorfirst-orderlogicalexpression(e.g.,Hornclauses).Readerswhoareinterestedinfirst-orderlogicrule-basedclassifiersmayrefertoreferencessuchas[278]orthevastliteratureoninductivelogicprogramming[279].Quinlan[287]proposedtheC4.5rulesalgorithmforextractingclassificationrulesfromdecisiontrees.AnindirectmethodforextractingrulesfromartificialneuralnetworkswasgivenbyAndrewsetal.in[198].

CoverandHart[220]presentedanoverviewofthenearestneighborclassificationmethodfromaBayesianperspective.Ahaprovidedboththeoreticalandempiricalevaluationsforinstance-basedmethodsin[196].PEBLS,whichwasdevelopedbyCostandSalzberg[219],isanearestneighborclassifierthatcanhandledatasetscontainingnominalattributes.

Error>50%

EachtrainingexampleinPEBLSisalsoassignedaweightfactorthatdependsonthenumberoftimestheexamplehelpsmakeacorrectprediction.Hanetal.[243]developedaweight-adjustednearestneighboralgorithm,inwhichthefeatureweightsarelearnedusingagreedy,hill-climbingoptimizationalgorithm.Amorerecentsurveyofk-nearestneighborclassificationisgivenbySteinbachandTan[298].

NaïveBayesclassifiershavebeeninvestigatedbymanyauthors,includingLangleyetal.[267],RamoniandSebastiani[288],Lewis[270],andDomingosandPazzani[227].AlthoughtheindependenceassumptionusedinnaïveBayesclassifiersmayseemratherunrealistic,themethodhasworkedsurprisinglywellforapplicationssuchastextclassification.Bayesiannetworksprovideamoreflexibleapproachbyallowingsomeoftheattributestobeinterdependent.AnexcellenttutorialonBayesiannetworksisgivenbyHeckermanin[252]andJensenin[258].Bayesiannetworksbelongtoabroaderclassofmodelsknownasprobabilisticgraphicalmodels.AformalintroductiontotherelationshipsbetweengraphsandprobabilitieswaspresentedinPearl[283].OthergreatresourcesonprobabilisticgraphicalmodelsincludebooksbyBishop[205],andJordan[259].Detaileddiscussionsofconceptssuchasd-separationandMarkovblanketsareprovidedinGeigeretal.[238]andRussellandNorvig[291].

Generalizedlinearmodels(GLM)arearichclassofregressionmodelsthathavebeenextensivelystudiedinthestatisticalliterature.TheywereformulatedbyNelderandWedderburnin1972[280]tounifyanumberofregressionmodelssuchaslinearregression,logisticregression,andPoissonregression,whichsharesomesimilaritiesintheirformulations.AnextensivediscussionofGLMsisprovidedinthebookbyMcCullaghandNelder[274].

Artificialneuralnetworks(ANN)havewitnessedalongandwindinghistoryofdevelopments,involvingmultiplephasesofstagnationandresurgence.The

ideaofamathematicalmodelofaneuralnetworkwasfirstintroducedin1943byMcCullochandPitts[275].Thisledtoaseriesofcomputationalmachinestosimulateaneuralnetworkbasedonthetheoryofneuralplasticity[289].Theperceptron,whichisthesimplestprototypeofmodernANNs,wasdevelopedbyRosenblattin1958[290].Theperceptronusesasinglelayerofprocessingunitsthatcanperformbasicmathematicaloperationssuchasadditionandmultiplication.However,theperceptroncanonlylearnlineardecisionboundariesandisguaranteedtoconvergeonlywhentheclassesarelinearlyseparable.Despitetheinterestinlearningmulti-layernetworkstoovercomethelimitationsofperceptron,progressinthisarearemainhalteduntiltheinventionofthebackpropagationalgorithmbyWerbosin1974[309],whichallowedforthequicktrainingofmulti-layerANNsusingthegradientdescentmethod.Thisledtoanupsurgeofinterestintheartificialintelligence(AI)communitytodevelopmulti-layerANNmodels,atrendthatcontinuedformorethanadecade.Historically,ANNsmarkaparadigmshiftinAIfromapproachesbasedonexpertsystems(whereknowledgeisencodedusingif-thenrules)tomachinelearningapproaches(wheretheknowledgeisencodedintheparametersofacomputationalmodel).However,therewerestillanumberofalgorithmicandcomputationalchallengesinlearninglargeANNmodels,whichremainedunresolvedforalongtime.ThishinderedthedevelopmentofANNmodelstothescalenecessaryforsolvingreal-worldproblems.Slowly,ANNsstartedgettingoutpacedbyotherclassificationmodelssuchassupportvectormachines,whichprovidedbetterperformanceaswellastheoreticalguaranteesofconvergenceandoptimality.Itisonlyrecentlythatthechallengesinlearningdeepneuralnetworkshavebeencircumvented,owingtobettercomputationalresourcesandanumberofalgorithmicimprovementsinANNssince2006.Thisre-emergenceofANNhasbeendubbedas“deeplearning,”whichhasoftenoutperformedexistingclassificationmodelsandgainedwide-spreadpopularity.

Deeplearningisarapidlyevolvingareaofresearchwithanumberofpotentiallyimpactfulcontributionsbeingmadeeveryyear.Someofthelandmarkadvancementsindeeplearningincludetheuseoflarge-scalerestrictedBoltzmannmachinesforlearninggenerativemodelsofdata[201,253],theuseofautoencodersanditsvariants(denoisingautoencoders)forlearningrobustfeaturerepresentations[199,305,306],andsophisticalarchitecturestopromotesharingofparametersacrossnodessuchasconvolutionalneuralnetworksforimages[265,268]andrecurrentneuralnetworksforsequences[241,242,277].OthermajorimprovementsincludetheapproachofunsupervisedpretrainingforinitializingANNmodels[232],thedropouttechniqueforregularization[254,297],batchnormalizationforfastlearningofANNparameters[256],andmaxoutnetworksforeffectiveusageofthedropouttechnique[240].EventhoughthediscussionsinthischapteronlearningANNmodelswerecenteredaroundthegradientdescentmethod,mostofthemodernANNmodelsinvolvingalargenumberofhiddenlayersaretrainedusingthestochasticgradientdescentmethodsinceitishighlyscalable[207].AnextensivesurveyofdeeplearningapproacheshasbeenpresentedinreviewarticlesbyBengio[200],LeCunetal.[269],andSchmidhuber[293].AnexcellentsummaryofdeeplearningapproachescanalsobeobtainedfromrecentbooksbyGoodfellowetal.[239]andNielsen[281].

Vapnik[303,304]haswrittentwoauthoritativebooksonSupportVectorMachines(SVM).OtherusefulresourcesonSVMandkernelmethodsincludethebooksbyCristianiniandShawe-Taylor[221]andSchölkopfandSmola[294].ThereareseveralsurveyarticlesonSVM,includingthosewrittenbyBurges[212],Bennetetal.[202],Hearst[251],andMangasarian[272].SVMcanalsobeviewedasan normregularizerofthehingelossfunction,asdescribedindetailbyHastieetal.[249].The normregularizerofthesquarelossfunctioncanbeobtainedusingtheleastabsoluteshrinkageandselectionoperator(Lasso),whichwasintroducedbyTibshiraniin1996[301].

L2L1

TheLassohasseveralinterestingpropertiessuchastheabilitytosimultaneouslyperformfeatureselectionaswellasregularization,sothatonlyasubsetoffeaturesareselectedinthefinalmodel.AnexcellentreviewofLassocanbeobtainedfromabookbyHastieetal.[250].

AsurveyofensemblemethodsinmachinelearningwasgivenbyDiet-terich[222].ThebaggingmethodwasproposedbyBreiman[209].FreundandSchapire[236]developedtheAdaBoostalgorithm.Arcing,whichstandsforadaptiveresamplingandcombining,isavariantoftheboostingalgorithmproposedbyBreiman[210].Itusesthenon-uniformweightsassignedtotrainingexamplestoresamplethedataforbuildinganensembleoftrainingsets.UnlikeAdaBoost,thevotesofthebaseclassifiersarenotweightedwhendeterminingtheclasslabeloftestexamples.TherandomforestmethodwasintroducedbyBreimanin[211].Theconceptofbias-variancedecompositionisexplainedindetailbyHastieetal.[249].Whilethebias-variancedecompositionwasinitiallyproposedforregressionproblemswithsquaredlossfunction,aunifiedframeworkforclassificationproblemsinvolving0–1losseswasintroducedbyDomingos[226].

RelatedworkonminingrareandimbalanceddatasetscanbefoundinthesurveypaperswrittenbyChawlaetal.[214]andWeiss[308].Sampling-basedmethodsforminingimbalanceddatasetshavebeeninvestigatedbymanyauthors,suchasKubatandMatwin[266],Japkowitz[257],andDrummondandHolte[228].Joshietal.[261]discussedthelimitationsofboostingalgorithmsforrareclassmodeling.OtheralgorithmsdevelopedforminingrareclassesincludeSMOTE[213],PNrule[260],andCREDOS[262].

Variousalternativemetricsthatarewell-suitedforclassimbalancedproblemsareavailable.Theprecision,recall,and -measurearewidely-usedmetricsininformationretrieval[302].ROCanalysiswasoriginallyusedinsignaldetectiontheoryforperformingaggregateevaluationoverarangeofscore

F1

thresholds.AmethodforcomparingclassifierperformanceusingtheconvexhullofROCcurveswassuggestedbyProvostandFawcettin[286].Bradley[208]investigatedtheuseofareaundertheROCcurve(AUC)asaperformancemetricformachinelearningalgorithms.DespitethevastbodyofliteratureonoptimizingtheAUCmeasureinmachinelearningmodels,itiswell-knownthatAUCsuffersfromcertainlimitations.Forexample,theAUCcanbeusedtocomparethequalityoftwoclassifiersonlyiftheROCcurveofoneclassifierstrictlydominatestheother.However,iftheROCcurvesoftwoclassifiersintersectatanypoint,thenitisdifficulttoassesstherelativequalityofclassifiersusingtheAUCmeasure.Anin-depthdiscussionofthepitfallsinusingAUCasaperformancemeasurecanbeobtainedinworksbyHand[245,246],andPowers[284].TheAUChasalsobeenconsideredtobeanincoherentmeasureofperformance,i.e.,itusesdifferentscaleswhilecomparingtheperformanceofdifferentclassifiers,althoughacoherentinterpretationofAUChasbeenprovidedbyFerrietal.[235].BerrarandFlach[203]describesomeofthecommoncaveatsinusingtheROCcurveforclinicalmicroarrayresearch.Analternateapproachformeasuringtheaggregateperformanceofaclassifieristheprecision-recall(PR)curve,whichisespeciallyusefulinthepresenceofclassimbalance[292].

Anexcellenttutorialoncost-sensitivelearningcanbefoundinareviewarticlebyLingandSheng[271].ThepropertiesofacostmatrixhadbeenstudiedbyElkanin[231].MargineantuandDietterich[273]examinedvariousmethodsforincorporatingcostinformationintotheC4.5learningalgorithm,includingwrappermethods,classdistribution-basedmethods,andloss-basedmethods.Othercost-sensitivelearningmethodsthatarealgorithm-independentincludeAdaCost[233],MetaCost[225],andcosting[312].

Extensiveliteratureisalsoavailableonthesubjectofmulticlasslearning.ThisincludestheworksofHastieandTibshirani[248],Allweinetal.[197],KongandDietterich[264],andTaxandDuin[300].Theerror-correctingoutput

coding(ECOC)methodwasproposedbyDietterichandBakiri[223].Theyhadalsoinvestigatedtechniquesfordesigningcodesthataresuitableforsolvingmulticlassproblems.

Apartfromexploringalgorithmsfortraditionalclassificationsettingswhereeveryinstancehasasinglesetoffeatureswithauniquecategoricallabel,therehasbeenalotofrecentinterestinnon-traditionalclassificationparadigms,involvingcomplexformsofinputsandoutputs.Forexample,theparadigmofmulti-labellearningallowsforaninstancetobeassignedmultipleclasslabelsratherthanjustone.Thisisusefulinapplicationssuchasobjectrecognitioninimages,whereaphotoimagemayincludemorethanoneclassificationobject,suchas,grass,sky,trees,andmountains.Asurveyonmulti-labellearningcanbefoundin[313].Asanotherexample,theparadigmofmulti-instancelearningconsiderstheproblemwheretheinstancesareavailableintheformofgroupscalledbags,andtraininglabelsareavailableatthelevelofbagsratherthanindividualinstances.Multi-instancelearningisusefulinapplicationswhereanobjectcanexistasmultipleinstancesindifferentstates(e.g.,thedifferentisomersofachemicalcompound),andevenifasingleinstanceshowsaspecificcharacteristic,theentirebagofinstancesassociatedwiththeobjectneedstobeassignedtherelevantclass.Asurveyonmulti-instancelearningisprovidedin[314].

Inanumberofreal-worldapplications,itisoftenthecasethatthetraininglabelsarescarceinquantity,becauseofthehighcostsassociatedwithobtaininggold-standardsupervision.However,wealmostalwayshaveabundantaccesstounlabeledtestinstances,whichdonothavesupervisedlabelsbutcontainvaluableinformationaboutthestructureordistributionofinstances.Traditionallearningalgorithms,whichonlymakeuseofthelabeledinstancesinthetrainingsetforlearningthedecisionboundary,areunabletoexploittheinformationcontainedinunlabeledinstances.Incontrast,learningalgorithmsthatmakeuseofthestructureintheunlabeleddataforlearningthe

classificationmodelareknownassemi-supervisedlearningalgorithms[315,316].Theuseofunlabeleddataisalsoexploredintheparadigmofmulti-viewlearning[299,311],whereeveryobjectisobservedinmultipleviewsofthedata,involvingdiversesetsoffeatures.Acommonstrategyusedbymulti-viewlearningalgorithmsisco-training[206],whereadifferentmodelislearnedforeveryviewofthedata,butthemodelpredictionsfromeveryviewareconstrainedtobeidenticaltoeachotherontheunlabeledtestinstances.

Anotherlearningparadigmthatiscommonlyexploredinthepaucityoftrainingdataistheframeworkofactivelearning,whichattemptstoseekthesmallestsetoflabelannotationstolearnareasonableclassificationmodel.Activelearningexpectstheannotatortobeinvolvedintheprocessofmodellearning,sothatthelabelsarerequestedincrementallyoverthemostrelevantsetofinstances,givenalimitedbudgetoflabelannotations.Forexample,itmaybeusefultoobtainlabelsoverinstancesclosertothedecisionboundarythatcanplayabiggerroleinfine-tuningtheboundary.Areviewonactivelearningapproachescanbefoundin[285,295].

Insomeapplications,itisimportanttosimultaneouslysolvemultiplelearningtaskstogether,wheresomeofthetasksmaybesimilartooneanother.Forexample,ifweareinterestedintranslatingapassagewritteninEnglishintodifferentlanguages,thetasksinvolvinglexicallysimilarlanguages(suchasSpanishandPortuguese)wouldrequiresimilarlearningofmodels.Theparadigmofmulti-tasklearninghelpsinsimultaneouslylearningacrossalltaskswhilesharingthelearningamongrelatedtasks.Thisisespeciallyusefulwhensomeofthetasksdonotcontainsufficientlymanytrainingsamples,inwhichcaseborrowingthelearningfromotherrelatedtaskshelpsinthelearningofrobustmodels.Aspecialcaseofmulti-tasklearningistransferlearning,wherethelearningfromasourcetask(withsufficientnumberoftrainingsamples)hastobetransferredtoadestinationtask(withpaucityof

trainingdata).AnextensivesurveyoftransferlearningapproachesisprovidedbyPanetal.[282].

Mostclassifiersassumeeverydatainstancemustbelongtoaclass,whichisnotalwaystrueforsomeapplications.Forexample,inmalwaredetection,duetotheeaseinwhichnewmalwaresarecreated,aclassifiertrainedonexistingclassesmayfailtodetectnewonesevenifthefeaturesforthenewmalwaresareconsiderablydifferentthanthoseforexistingmalwares.Anotherexampleisincriticalapplicationssuchasmedicaldiagnosis,wherepredictionerrorsarecostlyandcanhavesevereconsequences.Inthissituation,itwouldbebetterfortheclassifiertorefrainfrommakinganypredictiononadatainstanceifitisunsureofitsclass.Thisapproach,knownasclassifierwithrejectoption,doesnotneedtoclassifyeverydatainstanceunlessitdeterminesthepredictionisreliable(e.g.,iftheclassprobabilityexceedsauser-specifiedthreshold).Instancesthatareunclassifiedcanbepresentedtodomainexpertsforfurtherdeterminationoftheirtrueclasslabels.

Classifierscanalsobedistinguishedintermsofhowtheclassificationmodelistrained.Abatchclassifierassumesallthelabeledinstancesareavailableduringtraining.Thisstrategyisapplicablewhenthetrainingsetsizeisnottoolargeandforstationarydata,wheretherelationshipbetweentheattributesandclassesdoesnotvaryovertime.Anonlineclassifier,ontheotherhand,trainsaninitialmodelusingasubsetofthelabeleddata[263].Themodelisthenupdatedincrementallyasmorelabeledinstancesbecomeavailable.Thisstrategyiseffectivewhenthetrainingsetistoolargeorwhenthereisconceptdriftduetochangesinthedistributionofthedataovertime.

Bibliography[195]C.C.Aggarwal.Dataclassification:algorithmsandapplications.CRC

Press,2014.

[196]D.W.Aha.Astudyofinstance-basedalgorithmsforsupervisedlearningtasks:mathematical,empirical,andpsychologicalevaluations.PhDthesis,UniversityofCalifornia,Irvine,1990.

[197]E.L.Allwein,R.E.Schapire,andY.Singer.ReducingMulticlasstoBinary:AUnifyingApproachtoMarginClassifiers.JournalofMachineLearningResearch,1:113–141,2000.

[198]R.Andrews,J.Diederich,andA.Tickle.ASurveyandCritiqueofTechniquesForExtractingRulesFromTrainedArtificialNeuralNetworks.KnowledgeBasedSystems,8(6):373–389,1995.

[199]P.Baldi.Autoencoders,unsupervisedlearning,anddeeparchitectures.ICMLunsupervisedandtransferlearning,27(37-50):1,2012.

[200]Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandtrendsRinMachineLearning,2(1):1–127,2009.

[201]Y.Bengio,A.Courville,andP.Vincent.Representationlearning:Areviewandnewperspectives.IEEEtransactionsonpatternanalysisand

machineintelligence,35(8):1798–1828,2013.

[202]K.BennettandC.Campbell.SupportVectorMachines:HypeorHallelujah.SIGKDDExplorations,2(2):1–13,2000.

[203]D.BerrarandP.Flach.CaveatsandpitfallsofROCanalysisinclinicalmicroarrayresearch(andhowtoavoidthem).Briefingsinbioinformatics,pagebbr008,2011.

[204]C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,Oxford,U.K.,1995.

[205]C.M.Bishop.PatternRecognitionandMachineLearning.Springer,2006.

[206]A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.InProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100.ACM,1998.

[207]L.Bottou.Large-scalemachinelearningwithstochasticgradientdescent.InProceedingsofCOMPSTAT'2010,pages177–186.Springer,2010.

[208]A.P.Bradley.TheuseoftheareaundertheROCcurveintheEvaluationofMachineLearningAlgorithms.PatternRecognition,30(7):1145–1149,1997.

[209]L.Breiman.BaggingPredictors.MachineLearning,24(2):123–140,1996.

[210]L.Breiman.Bias,Variance,andArcingClassifiers.TechnicalReport486,UniversityofCalifornia,Berkeley,CA,1996.

[211]L.Breiman.RandomForests.MachineLearning,45(1):5–32,2001.

[212]C.J.C.Burges.ATutorialonSupportVectorMachinesforPatternRecognition.DataMiningandKnowledgeDiscovery,2(2):121–167,1998.

[213]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer.SMOTE:SyntheticMinorityOver-samplingTechnique.JournalofArtificialIntelligenceResearch,16:321–357,2002.

[214]N.V.Chawla,N.Japkowicz,andA.Kolcz.Editorial:SpecialIssueonLearningfromImbalancedDataSets.SIGKDDExplorations,6(1):1–6,2004.

[215]V.CherkasskyandF.Mulier.LearningfromData:Concepts,Theory,andMethods.WileyInterscience,1998.

[216]P.ClarkandR.Boswell.RuleInductionwithCN2:SomeRecentImprovements.InMachineLearning:Proc.ofthe5thEuropeanConf.(EWSL-91),pages151–163,1991.

[217]P.ClarkandT.Niblett.TheCN2InductionAlgorithm.MachineLearning,3(4):261–283,1989.

[218]W.W.Cohen.FastEffectiveRuleInduction.InProc.ofthe12thIntl.Conf.onMachineLearning,pages115–123,TahoeCity,CA,July1995.

[219]S.CostandS.Salzberg.AWeightedNearestNeighborAlgorithmforLearningwithSymbolicFeatures.MachineLearning,10:57–78,1993.

[220]T.M.CoverandP.E.Hart.NearestNeighborPatternClassification.KnowledgeBasedSystems,8(6):373–389,1995.

[221]N.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachinesandOtherKernel-basedLearningMethods.CambridgeUniversityPress,2000.

[222]T.G.Dietterich.EnsembleMethodsinMachineLearning.InFirstIntl.WorkshoponMultipleClassifierSystems,Cagliari,Italy,2000.

[223]T.G.DietterichandG.Bakiri.SolvingMulticlassLearningProblemsviaError-CorrectingOutputCodes.JournalofArtificialIntelligenceResearch,2:263–286,1995.

[224]P.Domingos.TheRISEsystem:Conqueringwithoutseparating.InProc.ofthe6thIEEEIntl.Conf.onToolswithArtificialIntelligence,pages704–707,NewOrleans,LA,1994.

[225]P.Domingos.MetaCost:AGeneralMethodforMakingClassifiersCost-Sensitive.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages155–164,SanDiego,CA,August1999.

[226]P.Domingos.Aunifiedbias-variancedecomposition.InProceedingsof17thInternationalConferenceonMachineLearning,pages231–238,2000.

[227]P.DomingosandM.Pazzani.OntheOptimalityoftheSimpleBayesianClassifierunderZero-OneLoss.MachineLearning,29(2-3):103–130,1997.

[228]C.DrummondandR.C.Holte.C4.5,Classimbalance,andCostsensitivity:Whyunder-samplingbeatsover-sampling.InICML'2004WorkshoponLearningfromImbalancedDataSetsII,Washington,DC,August2003.

[229]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,2ndedition,2001.

[230]M.H.Dunham.DataMining:IntroductoryandAdvancedTopics.PrenticeHall,2006.

[231]C.Elkan.TheFoundationsofCost-SensitiveLearning.InProc.ofthe17thIntl.JointConf.onArtificialIntelligence,pages973–978,Seattle,WA,August2001.

[232]D.Erhan,Y.Bengio,A.Courville,P.-A.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11(Feb):625–660,2010.

[233]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan.AdaCost:misclassificationcost-sensitiveboosting.InProc.ofthe16thIntl.Conf.onMachineLearning,pages97–105,Bled,Slovenia,June1999.

[234]J.FürnkranzandG.Widmer.Incrementalreducederrorpruning.InProc.ofthe11thIntl.Conf.onMachineLearning,pages70–77,NewBrunswick,NJ,July1994.

[235]C.Ferri,J.Hernández-Orallo,andP.A.Flach.AcoherentinterpretationofAUCasameasureofaggregatedclassificationperformance.InProceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11),pages657–664,2011.

[236]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences,55(1):119–139,1997.

[237]K.Fukunaga.IntroductiontoStatisticalPatternRecognition.AcademicPress,NewYork,1990.

[238]D.Geiger,T.S.Verma,andJ.Pearl.d-separation:Fromtheoremstoalgorithms.arXivpreprintarXiv:1304.1505,2013.

[239]I.Goodfellow,Y.Bengio,andA.Courville.DeepLearning.BookinpreparationforMITPress,2016.

[240]I.J.Goodfellow,D.Warde-Farley,M.Mirza,A.C.Courville,andY.Bengio.Maxoutnetworks.ICML(3),28:1319–1327,2013.

[241]A.Graves,M.Liwicki,S.Fernández,R.Bertolami,H.Bunke,andJ.Schmidhuber.Anovelconnectionistsystemforunconstrainedhandwritingrecognition.IEEEtransactionsonpatternanalysisandmachineintelligence,31(5):855–868,2009.

[242]A.GravesandJ.Schmidhuber.Offlinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages545–552,2009.

[243]E.-H.Han,G.Karypis,andV.Kumar.TextCategorizationUsingWeightAdjustedk-NearestNeighborClassification.InProc.ofthe5thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Lyon,France,2001.

[244]J.HanandM.Kamber.DataMining:ConceptsandTechniques.MorganKaufmannPublishers,SanFrancisco,2001.

[245]D.J.Hand.Measuringclassifierperformance:acoherentalternativetotheareaundertheROCcurve.Machinelearning,77(1):103–123,2009.

[246]D.J.Hand.Evaluatingdiagnostictests:theareaundertheROCcurveandthebalanceoferrors.Statisticsinmedicine,29(14):1502–1510,2010.

[247]D.J.Hand,H.Mannila,andP.Smyth.PrinciplesofDataMining.MITPress,2001.

[248]T.HastieandR.Tibshirani.Classificationbypairwisecoupling.AnnalsofStatistics,26(2):451–471,1998.

[249]T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,andPrediction.Springer,2ndedition,2009.

[250]T.Hastie,R.Tibshirani,andM.Wainwright.Statisticallearningwithsparsity:thelassoandgeneralizations.CRCPress,2015.

[251]M.Hearst.Trends&Controversies:SupportVectorMachines.IEEEIntelligentSystems,13(4):18–28,1998.

[252]D.Heckerman.BayesianNetworksforDataMining.DataMiningandKnowledgeDiscovery,1(1):79–119,1997.

[253]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.

[254]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.

[255]R.C.Holte.VerySimpleClassificationRulesPerformWellonMostCommonlyUsedDatasets.MachineLearning,11:63–91,1993.

[256]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.arXivpreprintarXiv:1502.03167,2015.

[257]N.Japkowicz.TheClassImbalanceProblem:SignificanceandStrategies.InProc.ofthe2000Intl.Conf.onArtificialIntelligence:SpecialTrackonInductiveLearning,volume1,pages111–117,LasVegas,NV,June2000.

[258]F.V.Jensen.AnintroductiontoBayesiannetworks,volume210.UCLpressLondon,1996.

[259]M.I.Jordan.Learningingraphicalmodels,volume89.SpringerScience&BusinessMedia,1998.

[260]M.V.Joshi,R.C.Agarwal,andV.Kumar.MiningNeedlesinaHaystack:ClassifyingRareClassesviaTwo-PhaseRuleInduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102,SantaBarbara,CA,June2001.

[261]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–306,Edmonton,Canada,July2002.

[262]M.V.JoshiandV.Kumar.CREDOS:ClassificationUsingRippleDownStructure(ACaseforRareClasses).InProc.oftheSIAMIntl.Conf.onDataMining,pages321–332,Orlando,FL,April2004.

[263]J.Kivinen,A.J.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEEtransactionsonsignalprocessing,52(8):2165–2176,2004.

[264]E.B.KongandT.G.Dietterich.Error-CorrectingOutputCodingCorrectsBiasandVariance.InProc.ofthe12thIntl.Conf.onMachineLearning,pages313–321,TahoeCity,CA,July1995.

[265]A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenetclassificationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012.

[266]M.KubatandS.Matwin.AddressingtheCurseofImbalancedTrainingSets:OneSidedSelection.InProc.ofthe14thIntl.Conf.onMachineLearning,pages179–186,Nashville,TN,July1997.

[267]P.Langley,W.Iba,andK.Thompson.AnanalysisofBayesianclassifiers.InProc.ofthe10thNationalConf.onArtificialIntelligence,pages223–228,1992.

[268]Y.LeCunandY.Bengio.Convolutionalnetworksforimages,speech,andtimeseries.Thehandbookofbraintheoryandneuralnetworks,3361(10):1995,1995.

[269]Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.Nature,521(7553):436–444,2015.

[270]D.D.Lewis.NaiveBayesatForty:TheIndependenceAssumptioninInformationRetrieval.InProc.ofthe10thEuropeanConf.onMachineLearning(ECML1998),pages4–15,1998.

[271]C.X.LingandV.S.Sheng.Cost-sensitivelearning.InEncyclopediaofMachineLearning,pages231–235.Springer,2011.

[272]O.Mangasarian.DataMiningviaSupportVectorMachines.TechnicalReportTechnicalReport01-05,DataMiningInstitute,May2001.

[273]D.D.MargineantuandT.G.Dietterich.LearningDecisionTreesforLossMinimizationinMulti-ClassProblems.TechnicalReport99-30-03,OregonStateUniversity,1999.

[274]P.McCullaghandJ.A.Nelder.Generalizedlinearmodels,volume37.CRCpress,1989.

[275]W.S.McCullochandW.Pitts.Alogicalcalculusoftheideasimmanentinnervousactivity.Thebulletinofmathematicalbiophysics,5(4):115–133,1943.

[276]R.S.Michalski,I.Mozetic,J.Hong,andN.Lavrac.TheMulti-PurposeIncrementalLearningSystemAQ15andItsTestingApplicationtoThree

MedicalDomains.InProc.of5thNationalConf.onArtificialIntelligence,Orlando,August1986.

[277]T.Mikolov,M.Karafiát,L.Burget,J.Cernock`y,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InInterspeech,volume2,page3,2010.

[278]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[279]S.Muggleton.FoundationsofInductiveLogicProgramming.PrenticeHall,EnglewoodCliffs,NJ,1995.

[280]J.A.NelderandR.J.Baker.Generalizedlinearmodels.Encyclopediaofstatisticalsciences,1972.

[281]M.A.Nielsen.Neuralnetworksanddeeplearning.Publishedonline:http://neuralnetworksanddeeplearning.com/.(visited:10.15.2016),2015.

[282]S.J.PanandQ.Yang.Asurveyontransferlearning.IEEETransactionsonknowledgeanddataengineering,22(10):1345–1359,2010.

[283]J.Pearl.Probabilisticreasoninginintelligentsystems:networksofplausibleinference.MorganKaufmann,2014.

[284]D.M.Powers.Theproblemofareaunderthecurve.In2012IEEEInternationalConferenceonInformationScienceandTechnology,pages

567–573.IEEE,2012.

[285]M.Prince.Doesactivelearningwork?Areviewoftheresearch.Journalofengineeringeducation,93(3):223–231,2004.

[286]F.J.ProvostandT.Fawcett.AnalysisandVisualizationofClassifierPerformance:ComparisonunderImpreciseClassandCostDistributions.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–48,NewportBeach,CA,August1997.

[287]J.R.Quinlan.C4.5:ProgramsforMachineLearning.Morgan-KaufmannPublishers,SanMateo,CA,1993.

[288]M.RamoniandP.Sebastiani.RobustBayesclassifiers.ArtificialIntelligence,125:209–226,2001.

[289]N.Rochester,J.Holland,L.Haibt,andW.Duda.Testsonacellassemblytheoryoftheactionofthebrain,usingalargedigitalcomputer.IRETransactionsoninformationTheory,2(3):80–93,1956.

[290]F.Rosenblatt.Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6):386,1958.

[291]S.J.Russell,P.Norvig,J.F.Canny,J.M.Malik,andD.D.Edwards.Artificialintelligence:amodernapproach,volume2.PrenticehallUpperSaddleRiver,2003.

[292]T.SaitoandM.Rehmsmeier.Theprecision-recallplotismoreinformativethantheROCplotwhenevaluatingbinaryclassifiersonimbalanceddatasets.PloSone,10(3):e0118432,2015.

[293]J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.NeuralNetworks,61:85–117,2015.

[294]B.SchölkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.

[295]B.Settles.Activelearningliteraturesurvey.UniversityofWisconsin,Madison,52(55-66):11,2010.

[296]P.SmythandR.M.Goodman.AnInformationTheoreticApproachtoRuleInductionfromDatabases.IEEETrans.onKnowledgeandDataEngineering,4(4):301–316,1992.

[297]N.Srivastava,G.E.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromoverfitting.JournalofMachineLearningResearch,15(1):1929–1958,2014.

[298]M.SteinbachandP.-N.Tan.kNN:k-NearestNeighbors.InX.WuandV.Kumar,editors,TheTopTenAlgorithmsinDataMining.ChapmanandHall/CRCReference,1stedition,2009.

[299]S.Sun.Asurveyofmulti-viewmachinelearning.NeuralComputingandApplications,23(7-8):2031–2038,2013.

[300]D.M.J.TaxandR.P.W.Duin.UsingTwo-ClassClassifiersforMulticlassClassification.InProc.ofthe16thIntl.Conf.onPatternRecognition(ICPR2002),pages124–127,Quebec,Canada,August2002.

[301]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),pages267–288,1996.

[302]C.J.vanRijsbergen.InformationRetrieval.Butterworth-Heinemann,Newton,MA,1978.

[303]V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995.

[304]V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,NewYork,1998.

[305]P.Vincent,H.Larochelle,Y.Bengio,andP.-A.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1096–1103.ACM,2008.

[306]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.-A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsina

deepnetworkwithalocaldenoisingcriterion.JournalofMachineLearningResearch,11(Dec):3371–3408,2010.

[307]A.R.Webb.StatisticalPatternRecognition.JohnWiley&Sons,2ndedition,2002.

[308]G.M.Weiss.MiningwithRarity:AUnifyingFramework.SIGKDDExplorations,6(1):7–19,2004.

[309]P.Werbos.Beyondregression:newfoolsforpredictionandanalysisinthebehavioralsciences.PhDthesis,HarvardUniversity,1974.

[310]I.H.WittenandE.Frank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.

[311]C.Xu,D.Tao,andC.Xu.Asurveyonmulti-viewlearning.arXivpreprintarXiv:1304.5634,2013.

[312]B.Zadrozny,J.C.Langford,andN.Abe.Cost-SensitiveLearningbyCost-ProportionateExampleWeighting.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages435–442,Melbourne,FL,August2003.

[313]M.-L.ZhangandZ.-H.Zhou.Areviewonmulti-labellearningalgorithms.IEEEtransactionsonknowledgeanddataengineering,26(8):1819–1837,2014.

[314]Z.-H.Zhou.Multi-instancelearning:Asurvey.DepartmentofComputerScience&Technology,NanjingUniversity,Tech.Rep,2004.

[315]X.Zhu.Semi-supervisedlearning.InEncyclopediaofmachinelearning,pages892–897.Springer,2011.

[316]X.ZhuandA.B.Goldberg.Introductiontosemi-supervisedlearning.Synthesislecturesonartificialintelligenceandmachinelearning,3(1):1–130,2009.

4.14Exercises1.Considerabinaryclassificationproblemwiththefollowingsetofattributesandattributevalues:

Supposearule-basedclassifierproducesthefollowingruleset:

a. Aretherulesmutuallyexclusive?

b. Istherulesetexhaustive?

c. Isorderingneededforthissetofrules?

d. Doyouneedadefaultclassfortheruleset?

2.TheRIPPERalgorithm(byCohen[218])isanextensionofanearlieralgorithmcalledIREP(byFürnkranzandWidmer[234]).Bothalgorithmsapplythereduced-errorpruningmethodtodeterminewhetheraruleneedstobepruned.Thereducederrorpruningmethodusesavalidationsettoestimatethegeneralizationerrorofaclassifier.Considerthefollowingpairofrules:

AirConditioner={Working,Broken}

Engine={Good,Bad}

Mileage={High,Medium,Low}

Rust={Yes,No}

Mileage=High→Mileage=HighMileage=Low→Value=HighAirConditioner=Working

R1:A→CR2:A∧B→C

isobtainedbyaddinganewconjunct,B,totheleft-handsideof .Forthisquestion,youwillbeaskedtodeterminewhether ispreferredoverfromtheperspectivesofrule-growingandrule-pruning.Todeterminewhetheraruleshouldbepruned,IREPcomputesthefollowingmeasure:

wherePisthetotalnumberofpositiveexamplesinthevalidationset,Nisthetotalnumberofnegativeexamplesinthevalidationset,pisthenumberofpositiveexamplesinthevalidationsetcoveredbytherule,andnisthenumberofnegativeexamplesinthevalidationsetcoveredbytherule.isactuallysimilartoclassificationaccuracyforthevalidationset.IREPfavorsrulesthathavehighervaluesof .Ontheotherhand,RIPPERappliesthefollowingmeasuretodeterminewhetheraruleshouldbepruned:

a. Suppose iscoveredby350positiveexamplesand150negativeexamples,while iscoveredby300positiveexamplesand50negativeexamples.ComputetheFOIL'sinformationgainfortherule withrespectto .

b. Consideravalidationsetthatcontains500positiveexamplesand500negativeexamples.For ,supposethenumberofpositiveexamplescoveredbytheruleis200,andthenumberofnegativeexamplescoveredbytheruleis50.For ,supposethenumberofpositiveexamplescoveredbytheruleis100andthenumberofnegativeexamplesis5.Compute

forbothrules.WhichruledoesIREPprefer?

c. Compute forthepreviousproblem.WhichruledoesRIPPERprefer?

R2 R1R2 R1

vIREP=p+(N−n)P+N,

vIREP

vIREP

vRIPPER=p−nP+n.

R1R2

R2R1

R1

R2

vIREP

vRIPPER

3.C4.5rulesisanimplementationofanindirectmethodforgeneratingrulesfromadecisiontree.RIPPERisanimplementationofadirectmethodforgeneratingrulesdirectlyfromdata.

a. Discussthestrengthsandweaknessesofbothmethods.

b. Consideradatasetthathasalargedifferenceintheclasssize(i.e.,someclassesaremuchbiggerthanothers).Whichmethod(betweenC4.5rulesandRIPPER)isbetterintermsoffindinghighaccuracyrulesforthesmallclasses?

4.Consideratrainingsetthatcontains100positiveexamplesand400negativeexamples.Foreachofthefollowingcandidaterules,

determinewhichisthebestandworstcandidateruleaccordingto:

a. Ruleaccuracy.

b. FOIL'sinformationgain.

c. Thelikelihoodratiostatistic.

d. TheLaplacemeasure.

e. Them-estimatemeasure(with and ).

5.Figure4.3 illustratesthecoverageoftheclassificationrulesR1,R2,andR3.Determinewhichisthebestandworstruleaccordingto:

a. Thelikelihoodratiostatistic.

b. TheLaplacemeasure.

R1:A→+(covers4positiveand1negativeexamples),R2:B→+(covers30positiveand10negativeexamples),R3:C→+(covers100positiveand90negativeexamples),

k=2 p+=0.2

c. Them-estimatemeasure(with and ).

d. TheruleaccuracyafterR1hasbeendiscovered,wherenoneoftheexamplescoveredbyR1arediscarded.

e. TheruleaccuracyafterR1hasbeendiscovered,whereonlythepositiveexamplescoveredbyR1arediscarded.

f. TheruleaccuracyafterR1hasbeendiscovered,wherebothpositiveandnegativeexamplescoveredbyR1arediscarded.

6.

a. Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?

b. Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?

c. Repeatpart(b)assumingthatthestudentisasmoker.

d. Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.

7.ConsiderthedatasetshowninTable4.9

Table4.9.DatasetforExercise7.

Instance A B C Class

1 0 0 0

k=2 p+=0.58

+

2 0 0 1

3 0 1 1

4 0 1 1

5 0 0 1

6 1 0 1

7 1 0 1

8 1 0 1

9 1 1 1

10 1 0 1

a. Estimatetheconditionalprobabilitiesfor,and .

b. UsetheestimateofconditionalprobabilitiesgiveninthepreviousquestiontopredicttheclasslabelforatestsampleusingthenaïveBayesapproach.

c. Estimatetheconditionalprobabilitiesusingthem-estimateapproach,withand .

d. Repeatpart(b)usingtheconditionalprobabilitiesgiveninpart(c).

e. Comparethetwomethodsforestimatingprobabilities.Whichmethodisbetterandwhy?

8.ConsiderthedatasetshowninTable4.10 .

Table4.10.DatasetforExercise8.

+

+

+

+

P(A|+),P(B|+),P(C|+),P(A|−),P(B|−) P(C|−)

(A=0,B=1,C=0)

p=1/2 m=4

Instance A B C Class

1 0 0 1

2 1 0 1

3 0 1 0

4 1 0 0

5 1 0 1

6 0 0 1

7 1 1 0

8 0 0 0

9 0 1 0

10 1 1 1 +

a. Estimatetheconditionalprobabilitiesfor,and using

thesameapproachasinthepreviousproblem.

b. Usetheconditionalprobabilitiesinpart(a)topredicttheclasslabelforatestsample usingthenaïveBayesapproach.

c. Compare ,and .StatetherelationshipsbetweenAandB.

d. Repeattheanalysisinpart(c)using ,and .

e. Compare against and.Arethevariablesconditionallyindependentgiventhe

class?

+

+

+

+

P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−) P(C=1|−)

(A=1,B=1,C=1)

P(A=1),P(B=1) P(A=1,B=1)

P(A=1),P(B=0) P(A=1,B=0)

P(A=1,B=1|Class=+) P(A=1|Class=+)P(B=1|Class=+)

9.

a. ExplainhownaïveBayesperformsonthedatasetshowninFigure4.56 .

b. Ifeachclassisfurtherdividedsuchthattherearefourclasses(A1,A2,B1,andB2),willnaïveBayesperformbetter?

c. Howwilladecisiontreeperformonthisdataset(forthetwo-classproblem)?Whatiftherearefourclasses?

10.Figure4.57 illustratestheBayesiannetworkforthedatasetshowninTable4.11 .(Assumethatalltheattributesarebinary).

a. Drawtheprobabilitytableforeachnodeinthenetwork.

b. UsetheBayesiannetworktocompute.

11.GiventheBayesiannetworkshowninFigure4.58 ,computethefollowingprobabilities:

P(Engine=Bad,AirConditioner=Broken)

Figure4.56.DatasetforExercise9.

Figure4.57.Bayesiannetwork.

a. .P(B=good,F=empty,G=empty,S=yes)

b. .

c. Giventhatthebatteryisbad,computetheprobabilitythatthecarwillstart.

12.Considertheone-dimensionaldatasetshowninTable4.12 .

a. Classifythedatapoint accordingtoits1-,3-,5-,and9-nearestneighbors(usingmajorityvote).

b. Repeatthepreviousanalysisusingthedistance-weightedvotingapproachdescribedinSection4.3.1 .

Table4.11.DatasetforExercise10.

Mileage Engine AirConditioner

NumberofInstanceswith

NumberofInstanceswith

Hi Good Working 3 4

Hi Good Broken 1 2

Hi Bad Working 1 5

Hi Bad Broken 0 4

Lo Good Working 9 0

Lo Good Broken 5 1

Lo Bad Working 1 2

Lo Bad Broken 0 2

P(B=bad,F=empty,G=notempty,S=no)

x=5.0

CarValue=Hi CarValue=Lo

Figure4.58.BayesiannetworkforExercise11.

13.ThenearestneighboralgorithmdescribedinSection4.3 canbeextendedtohandlenominalattributes.AvariantofthealgorithmcalledPEBLS(ParallelExemplar-BasedLearningSystem)byCostandSalzberg[219]measuresthedistancebetweentwovaluesofanominalattributeusingthemodifiedvaluedifferencemetric(MVDM).Givenapairofnominalattributevalues, and ,thedistancebetweenthemisdefinedasfollows:

where isthenumberofexamplesfromclassiwithattributevalue andisthenumberofexampleswithattributevalue

Table4.12.DatasetforExercise12.

x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5

V1 V2

d(V1,V2)=∑i=1k|ni1n1−ni2n2,| (4.108)

nij Vj njVj.

y

ConsiderthetrainingsetfortheloanclassificationproblemshowninFigure4.8 .UsetheMVDMmeasuretocomputethedistancebetweeneverypairofattributevaluesforthe and attributes.

14.ForeachoftheBooleanfunctionsgivenbelow,statewhethertheproblemislinearlyseparable.

a. AANDBANDC

b. NOTAANDB

c. (AORB)AND(AORC)

d. (AXORB)AND(AORB)

15.

a. DemonstratehowtheperceptronmodelcanbeusedtorepresenttheANDandORfunctionsbetweenapairofBooleanvariables.

b. Commentonthedisadvantageofusinglinearfunctionsasactivationfunctionsformulti-layerneuralnetworks.

16.Youareaskedtoevaluatetheperformanceoftwoclassificationmodels,and .Thetestsetyouhavechosencontains26binaryattributes,

labeledasAthroughZ.Table4.13 showstheposteriorprobabilitiesobtainedbyapplyingthemodelstothetestset.(Onlytheposteriorprobabilitiesforthepositiveclassareshown).Asthisisatwo-classproblem,

and .Assumethatwearemostlyinterestedindetectinginstancesfromthepositiveclass.

a. PlottheROCcurveforboth and .(Youshouldplotthemonthesamegraph.)Whichmodeldoyouthinkisbetter?Explainyourreasons.

− − + + + − − + − −

M1 M2

P(−)=1−P(+) P(−|A,…,Z)=1−P(+|A,…,Z)

M1 M2

b. Formodel ,supposeyouchoosethecutoffthresholdtobe .Inotherwords,anytestinstanceswhoseposteriorprobabilityisgreaterthantwillbeclassifiedasapositiveexample.Computetheprecision,recall,andF-measureforthemodelatthisthresholdvalue.

c. Repeattheanalysisforpart(b)usingthesamecutoffthresholdonmodel.ComparetheF-measureresultsforbothmodels.Whichmodelis

better?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

d. Repeatpart(b)formodel usingthethreshold .Whichthresholddoyouprefer, or ?AretheresultsconsistentwithwhatyouexpectfromtheROCcurve?

Table4.13.PosteriorprobabilitiesforExercise16.

Instance TrueClass

1 0.73 0.61

2 0.69 0.03

3 0.44 0.68

4 0.55 0.31

5 0.67 0.45

6 0.47 0.09

7 0.08 0.38

8 0.15 0.05

9 0.45 0.01

10 0.35 0.04

M1 t=0.5

M2

M1 t=0.1t=0.5 t=0.1

P(+|A,…,Z,M1) P(+|A,…,Z,M2)

+

+

+

+

+

17.Followingisadatasetthatcontainstwoattributes,XandY,andtwoclasslabels,“ ”and“ ”.Eachattributecantakethreedifferentvalues:0,1,or2.

X Y NumberofInstances

0 0 0 100

1 0 0 0

2 0 0 100

0 1 10 100

1 1 10 0

2 1 10 100

0 2 0 100

1 2 0 0

2 2 0 100

Theconceptforthe“ ”classis andtheconceptforthe“ ”classis.

a. Buildadecisiontreeonthedataset.Doesthetreecapturethe“ ”and“ ”concepts?

b. Whataretheaccuracy,precision,recall,and -measureofthedecisiontree?(Notethatprecision,recall,and -measurearedefinedwith

+ −

+ −

+ Y=1 −X=0∨X=2

+−

F1F1

respecttothe“ ”class.)

c. Buildanewdecisiontreewiththefollowingcostfunction:

(Hint:onlytheleavesoftheolddecisiontreeneedtobechanged.)Doesthedecisiontreecapturethe“ ”concept?

d. Whataretheaccuracy,precision,recall,and -measureofthenewdecisiontree?

18.Considerthetaskofbuildingaclassifierfromrandomdata,wheretheattributevaluesaregeneratedrandomlyirrespectiveoftheclasslabels.Assumethedatasetcontainsinstancesfromtwoclasses,“ ”and“ .”Halfofthedatasetisusedfortrainingwhiletheremaininghalfisusedfortesting.

a. Supposethereareanequalnumberofpositiveandnegativeinstancesinthedataandthedecisiontreeclassifierpredictseverytestinstancetobepositive.Whatistheexpectederrorrateoftheclassifieronthetestdata?

b. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability0.8andnegativeclasswithprobability0.2.

c. Supposetwo-thirdsofthedatabelongtothepositiveclassandtheremainingone-thirdbelongtothenegativeclass.Whatistheexpectederrorofaclassifierthatpredictseverytestinstancetobepositive?

d. Repeatthepreviousanalysisassumingthattheclassifierpredictseachtestinstancetobepositiveclasswithprobability2/3andnegativeclasswithprobability1/3.

+

C(i,j)={0,ifi=j;1,ifi=+,j=−;Numberof−instancesNumberof+instancesifi=−,j=+;

+

F1

+ −

19.DerivethedualLagrangianforthelinearSVMwithnon-separabledatawheretheobjectivefunctionis

20.ConsidertheXORproblemwheretherearefourtrainingpoints:

Transformthedataintothefollowingfeaturespace:

Findthemaximummarginlineardecisionboundaryinthetransformedspace.

21.GiventhedatasetsshowninFigures4.59 ,explainhowthedecisiontree,naïveBayes,andk-nearestneighborclassifierswouldperformonthesedatasets.

f(w)=ǁwǁ22+C(∑i=1Nξi)2.

(1,1,−),(1,0,+),(0,1,+),(0,0,−).

φ=(1,2×1,2×2,2x1x2,x12,x22).

Figure4.59.DatasetforExercise21.

5AssociationAnalysis:BasicConceptsandAlgorithms

Manybusinessenterprisesaccumulatelargequantitiesofdatafromtheirday-to-dayoperations.Forexample,hugeamountsofcustomerpurchasedataarecollecteddailyatthecheckoutcountersofgrocerystores.Table5.1 givesanexampleofsuchdata,commonlyknownasmarketbaskettransactions.Eachrowinthistablecorrespondstoatransaction,whichcontainsauniqueidentifierlabeledTIDandasetofitemsboughtbyagivencustomer.Retailersareinterestedinanalyzingthedatatolearnaboutthepurchasingbehavioroftheircustomers.Suchvaluableinformationcanbeusedtosupportavarietyofbusiness-relatedapplicationssuchasmarketingpromotions,inventorymanagement,andcustomerrelationshipmanagement.

Table5.1.Anexampleofmarketbaskettransactions.

TID Items

1 {Bread,Milk}

2 {Bread,Diapers,Beer,Eggs}

3 {Milk,Diapers,Beer,Cola}

4 {Bread,Milk,Diapers,Beer}

5 {Bread,Milk,Diapers,Cola}

Thischapterpresentsamethodologyknownasassociationanalysis,whichisusefulfordiscoveringinterestingrelationshipshiddeninlargedatasets.Theuncoveredrelationshipscanberepresentedintheformofsetsofitemspresentinmanytransactions,whichareknownasfrequentitemsets,orassociationrules,thatrepresentrelationshipsbetweentwoitemsets.Forexample,thefollowingrulecanbeextractedfromthedatasetshowninTable5.1 :

Therulesuggestsarelationshipbetweenthesaleofdiapersandbeerbecausemanycustomerswhobuydiapersalsobuybeer.Retailerscanusethesetypesofrulestohelpthemidentifynewopportunitiesforcross-sellingtheirproductstothecustomers.

{Diapers}→{Beer}.

Besidesmarketbasketdata,associationanalysisisalsoapplicabletodatafromotherapplicationdomainssuchasbioinformatics,medicaldiagnosis,webmining,andscientificdataanalysis.IntheanalysisofEarthsciencedata,forexample,associationpatternsmayrevealinterestingconnectionsamongtheocean,land,andatmosphericprocesses.SuchinformationmayhelpEarthscientistsdevelopabetterunderstandingofhowthedifferentelementsoftheEarthsysteminteractwitheachother.Eventhoughthetechniquespresentedherearegenerallyapplicabletoawidervarietyofdatasets,forillustrativepurposes,ourdiscussionwillfocusmainlyonmarketbasketdata.

Therearetwokeyissuesthatneedtobeaddressedwhenapplyingassociationanalysistomarketbasketdata.First,discoveringpatternsfromalargetransactiondatasetcanbecomputationallyexpensive.Second,someofthediscoveredpatternsmaybespurious(happensimplybychance)andevenfornon-spuriouspatterns,somearemoreinterestingthanothers.Theremainderofthischapterisorganizedaroundthesetwoissues.Thefirstpartofthechapterisdevotedtoexplainingthebasicconceptsofassociationanalysisandthealgorithmsusedtoefficientlyminesuchpatterns.Thesecondpartofthechapterdealswiththeissueofevaluatingthediscoveredpatternsinordertohelppreventthegenerationofspuriousresultsandtorankthepatternsintermsofsomeinterestingnessmeasure.

5.1PreliminariesThissectionreviewsthebasicterminologyusedinassociationanalysisandpresentsaformaldescriptionofthetask.

BinaryRepresentation

MarketbasketdatacanberepresentedinabinaryformatasshowninTable5.2 ,whereeachrowcorrespondstoatransactionandeachcolumncorrespondstoanitem.Anitemcanbetreatedasabinaryvariablewhosevalueisoneiftheitemispresentinatransactionandzerootherwise.Becausethepresenceofaniteminatransactionisoftenconsideredmoreimportantthanitsabsence,anitemisanasymmetricbinaryvariable.Thisrepresentationisasimplisticviewofrealmarketbasketdatabecauseitignoresimportantaspectsofthedatasuchasthequantityofitemssoldorthepricepaidtopurchasethem.Methodsforhandlingsuchnon-binarydatawillbeexplainedinChapter6 .

Table5.2.Abinary0/1representationofmarketbasketdata.

TID Bread Milk Diapers Beer Eggs Cola

1 1 1 0 0 0 0

2 1 0 1 1 1 0

3 0 1 1 1 0 1

4 1 1 1 1 0 0

5 1 1 1 0 0 1

ItemsetandSupportCount

Let bethesetofallitemsinamarketbasketdataandbethesetofalltransactions.Eachtransaction, containsa

subsetofitemschosenfromI.Inassociationanalysis,acollectionofzeroormoreitemsistermedanitemset.Ifanitemsetcontainskitems,itiscalledak-itemset.Forinstance,{ , , }isanexampleofa3-itemset.Thenull(orempty)setisanitemsetthatdoesnotcontainanyitems.

Atransaction issaidtocontainanitemsetXifXisasubsetof .Forexample,thesecondtransactionshowninTable5.2 containstheitemset{ , }butnot{ , }.Animportantpropertyofanitemsetisitssupportcount,whichreferstothenumberoftransactionsthatcontainaparticularitemset.Mathematically,thesupportcount, ,foranitemsetXcanbestatedasfollows:

wherethesymbol denotesthenumberofelementsinaset.InthedatasetshowninTable5.2 ,thesupportcountfor{ , , }isequaltotwobecausethereareonlytwotransactionsthatcontainallthreeitems.

Often,thepropertyofinterestisthesupport,whichisfractionoftransactionsinwhichanitemsetoccurs:

AnitemsetXiscalledfrequentifs(X)isgreaterthansomeuser-definedthreshold,minsup.

I={i1,i2,…,id} T={t1,t2,…,tN} ti

tj tj

σ(X)

σ(X)=|{ti|X⊆ti,ti∈T}|,

|⋅|

s(X)=σ(X)/N.

AssociationRule

Anassociationruleisanimplicationexpressionoftheform ,whereXandYaredisjointitemsets,i.e., .Thestrengthofanassociationrulecanbemeasuredintermsofitssupportandconfidence.Supportdetermineshowoftenaruleisapplicabletoagivendataset,whileconfidencedetermineshowfrequentlyitemsinYappearintransactionsthatcontainX.Theformaldefinitionsofthesemetricsare

Example5.1.Considertherule Becausethesupportcountfor{ , , }is2andthetotalnumberoftransactionsis5,therule'ssupportis .Therule'sconfidenceisobtainedbydividingthesupportcountfor{ , , }bythesupportcountfor{ ,

}.Sincethereare3transactionsthatcontainmilkanddiapers,theconfidenceforthisruleis .

WhyUseSupportandConfidence?

Supportisanimportantmeasurebecausearulethathasverylowsupportmightoccursimplybychance.Also,fromabusinessperspectivealowsupportruleisunlikelytobeinterestingbecauseitmightnotbeprofitabletopromoteitemsthatcustomersseldombuytogether(withtheexceptionofthesituationdescribedinSection5.8 ).Forthesereasons,weareinterestedinfindingruleswhosesupportisgreaterthansomeuser-definedthreshold.As

X→YX∩Y=∅

Support,s(X→Y)=σ(X∪Y)N; (5.1)

Confidence,c(X→Y)=σ(X∪Y)σ(X). (5.2)

2/5=0.4

2/3=0.67

willbeshowninSection5.2.1 ,supportalsohasadesirablepropertythatcanbeexploitedfortheefficientdiscoveryofassociationrules.

Confidence,ontheotherhand,measuresthereliabilityoftheinferencemadebyarule.Foragivenrule ,thehighertheconfidence,themorelikelyitisforYtobepresentintransactionsthatcontainX.ConfidencealsoprovidesanestimateoftheconditionalprobabilityofYgivenX.

Associationanalysisresultsshouldbeinterpretedwithcaution.Theinferencemadebyanassociationruledoesnotnecessarilyimplycausality.Instead,itcansometimessuggestastrongco-occurrencerelationshipbetweenitemsintheantecedentandconsequentoftherule.Causality,ontheotherhand,requiresknowledgeaboutwhichattributesinthedatacapturecauseandeffect,andtypicallyinvolvesrelationshipsoccurringovertime(e.g.,greenhousegasemissionsleadtoglobalwarming).SeeSection5.7.1 foradditionaldiscussion.

FormulationoftheAssociationRuleMiningProblem

Theassociationruleminingproblemcanbeformallystatedasfollows:

Definition5.1.(AssociationRuleDiscovery.)GivenasetoftransactionsT,findalltheruleshaving

and ,whereminsupandminconfarethecorrespondingsupportandconfidencethresholds.

X→Y

support≥minsup confidence≥minconf

Abrute-forceapproachforminingassociationrulesistocomputethesupportandconfidenceforeverypossiblerule.Thisapproachisprohibitivelyexpensivebecausethereareexponentiallymanyrulesthatcanbeextractedfromadataset.Morespecifically,assumingthatneithertheleftnortheright-handsideoftheruleisanemptyset,thetotalnumberofpossiblerules,R,extractedfromadatasetthatcontainsditemsis

Theproofforthisequationisleftasanexercisetothereaders(seeExercise5onpage440).EvenforthesmalldatasetshowninTable5.1 ,thisapproachrequiresustocomputethesupportandconfidencefor

rules.Morethan80%oftherulesarediscardedafterapplyingand ,thuswastingmostofthecomputations.To

avoidperformingneedlesscomputations,itwouldbeusefultoprunetherulesearlywithouthavingtocomputetheirsupportandconfidencevalues.

Aninitialsteptowardimprovingtheperformanceofassociationruleminingalgorithmsistodecouplethesupportandconfidencerequirements.FromEquation5.1 ,noticethatthesupportofarule isthesameasthesupportofitscorrespondingitemset, .Forexample,thefollowingruleshaveidenticalsupportbecausetheyinvolveitemsfromthesameitemset,

{ , , }:

R=3d−2d+1+1. (5.3)

36−27+1=602minsup=20% mincof=50%

X→YX∪Y

{Beer,Diapers}→{Milk},{Beer,Milk}→{Diapers},{Diapers,Milk}→{Beer},{Beer}→{Diapers,Milk},{Milk}→{Beer,Diapers},{Diapers}→{Beer,Milk}.

Iftheitemsetisinfrequent,thenallsixcandidaterulescanbeprunedimmediatelywithoutourhavingtocomputetheirconfidencevalues.

Therefore,acommonstrategyadoptedbymanyassociationruleminingalgorithmsistodecomposetheproblemintotwomajorsubtasks:

1. FrequentItemsetGeneration,whoseobjectiveistofindalltheitemsetsthatsatisfytheminsupthreshold.

2. RuleGeneration,whoseobjectiveistoextractallthehighconfidencerulesfromthefrequentitemsetsfoundinthepreviousstep.Theserulesarecalledstrongrules.

Thecomputationalrequirementsforfrequentitemsetgenerationaregenerallymoreexpensivethanthoseofrulegeneration.EfficienttechniquesforgeneratingfrequentitemsetsandassociationrulesarediscussedinSections5.2 and5.3 ,respectively.

5.2FrequentItemsetGenerationAlatticestructurecanbeusedtoenumeratethelistofallpossibleitemsets.Figure5.1 showsanitemsetlatticefor .Ingeneral,adatasetthatcontainskitemscanpotentiallygenerateupto frequentitemsets,excludingthenullset.Becausekcanbeverylargeinmanypracticalapplications,thesearchspaceofitemsetsthatneedtobeexploredisexponentiallylarge.

Figure5.1.

I={a,b,c,d,e}2k−1

Anitemsetlattice.

Abrute-forceapproachforfindingfrequentitemsetsistodeterminethesupportcountforeverycandidateitemsetinthelatticestructure.Todothis,weneedtocompareeachcandidateagainsteverytransaction,anoperationthatisshowninFigure5.2 .Ifthecandidateiscontainedinatransaction,itssupportcountwillbeincremented.Forexample,thesupportfor{ ,

}isincrementedthreetimesbecausetheitemsetiscontainedintransactions1,4,and5.SuchanapproachcanbeveryexpensivebecauseitrequiresO(NMw)comparisons,whereNisthenumberoftransactions,isthenumberofcandidateitemsets,andwisthemaximumtransaction

width.Transactionwidthisthenumberofitemspresentinatransaction.

Figure5.2.Countingthesupportofcandidateitemsets.

Therearethreemainapproachesforreducingthecomputationalcomplexityoffrequentitemsetgeneration.

1. Reducethenumberofcandidateitemsets(M).TheAprioriprinciple,describedinthenextSection,isaneffectivewaytoeliminatesomeof

M=2k−1

thecandidateitemsetswithoutcountingtheirsupportvalues.2. Reducethenumberofcomparisons.Insteadofmatchingeach

candidateitemsetagainsteverytransaction,wecanreducethenumberofcomparisonsbyusingmoreadvanceddatastructures,eithertostorethecandidateitemsetsortocompressthedataset.WewilldiscussthesestrategiesinSections5.2.4 and5.6 ,respectively.

3. Reducethenumberoftransactions(N).Asthesizeofcandidateitemsetsincreases,fewertransactionswillbesupportedbytheitemsets.Forinstance,sincethewidthofthefirsttransactioninTable5.1 is2,itwouldbeadvantageoustoremovethistransactionbeforesearchingforfrequentitemsetsofsize3andlarger.AlgorithmsthatemploysuchastrategyarediscussedintheBibliographicNotes.

5.2.1TheAprioriPrinciple

ThisSectiondescribeshowthesupportmeasurecanbeusedtoreducethenumberofcandidateitemsetsexploredduringfrequentitemsetgeneration.Theuseofsupportforpruningcandidateitemsetsisguidedbythefollowingprinciple.

Theorem5.1(AprioriPrinciple).Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.

ToillustratetheideabehindtheAprioriprinciple,considertheitemsetlatticeshowninFigure5.3 .Suppose{c,d,e}isafrequentitemset.Clearly,anytransactionthatcontains{c,d,e}mustalsocontainitssubsets,{c,d},{c,e},{d,e},{c},{d},and{e}.Asaresult,if{c,d,e}isfrequent,thenallsubsetsof{c,d,e}(i.e.,theshadeditemsetsinthisfigure)mustalsobefrequent.

Figure5.3.AnillustrationoftheAprioriprinciple.If{c,d,e}isfrequent,thenallsubsetsofthisitemsetarefrequent.

Conversely,ifanitemsetsuchas{a,b}isinfrequent,thenallofitssupersetsmustbeinfrequenttoo.AsillustratedinFigure5.4 ,theentiresubgraph

containingthesupersetsof{a,b}canbeprunedimmediatelyonce{a,b}isfoundtobeinfrequent.Thisstrategyoftrimmingtheexponentialsearchspacebasedonthesupportmeasureisknownassupport-basedpruning.Suchapruningstrategyismadepossiblebyakeypropertyofthesupportmeasure,namely,thatthesupportforanitemsetneverexceedsthesupportforitssubsets.Thispropertyisalsoknownastheanti-monotonepropertyofthesupportmeasure.

Figure5.4.Anillustrationofsupport-basedpruning.If{a,b}isinfrequent,thenallsupersetsof{a,b}areinfrequent.

Definition5.2.(Anti-monotoneProperty.)Ameasurefpossessestheanti-monotonepropertyifforeveryitemsetXthatisapropersubsetofitemsetY,i.e. ,wehave

.

Moregenerally,alargenumberofmeasures—seeSection5.7.1 —canbeappliedtoitemsetstoevaluatevariouspropertiesofitemsets.AswillbeshowninthenextSection,anymeasurethathastheanti-monotonepropertycanbeincorporateddirectlyintoanitemsetminingalgorithmtoeffectivelyprunetheexponentialsearchspaceofcandidateitemsets.

5.2.2FrequentItemsetGenerationintheAprioriAlgorithm

Aprioriisthefirstassociationruleminingalgorithmthatpioneeredtheuseofsupport-basedpruningtosystematicallycontroltheexponentialgrowthofcandidateitemsets.Figure5.5 providesahigh-levelillustrationofthefrequentitemsetgenerationpartoftheApriorialgorithmforthetransactionsshowninTable5.1 .Weassumethatthesupportthresholdis60%,whichisequivalenttoaminimumsupportcountequalto3.

X⊂Yf(Y)≤f(X)

Figure5.5.IllustrationoffrequentitemsetgenerationusingtheApriorialgorithm.

Initially,everyitemisconsideredasacandidate1-itemset.Aftercountingtheirsupports,thecandidateitemsets{ }and{ }arediscardedbecausetheyappearinfewerthanthreetransactions.Inthenextiteration,candidate2-itemsetsaregeneratedusingonlythefrequent1-itemsetsbecausetheAprioriprincipleensuresthatallsupersetsoftheinfrequent1-itemsetsmustbeinfrequent.Becausethereareonlyfourfrequent1-itemsets,thenumberofcandidate2-itemsetsgeneratedbythealgorithmis .Twoofthesesixcandidates,{ , }and{ , },aresubsequentlyfoundtobeinfrequentaftercomputingtheirsupportvalues.Theremainingfourcandidatesarefrequent,andthuswillbeusedtogeneratecandidate3-itemsets.Withoutsupport-basedpruning,thereare candidate3-itemsetsthatcanbeformedusingthesixitemsgiveninthisexample.With

(42)=6

(63)=20

theAprioriprinciple,weonlyneedtokeepcandidate3-itemsetswhosesubsetsarefrequent.Theonlycandidatethathasthispropertyis{ ,

, }.However,eventhoughthesubsetsof{ , , }arefrequent,theitemsetitselfisnot.

TheeffectivenessoftheAprioripruningstrategycanbeshownbycountingthenumberofcandidateitemsetsgenerated.Abrute-forcestrategyofenumeratingallitemsets(uptosize3)ascandidateswillproduce

candidates.WiththeAprioriprinciple,thisnumberdecreasesto

candidates,whichrepresentsa68%reductioninthenumberofcandidateitemsetseveninthissimpleexample.

ThepseudocodeforthefrequentitemsetgenerationpartoftheApriorialgorithmisshowninAlgorithm5.1 .Let denotethesetofcandidatek-itemsetsand denotethesetoffrequentk-itemsets:

Thealgorithminitiallymakesasinglepassoverthedatasettodeterminethesupportofeachitem.Uponcompletionofthisstep,thesetofallfrequent1-itemsets, ,willbeknown(steps1and2).Next,thealgorithmwilliterativelygeneratenewcandidatek-itemsetsandpruneunnecessarycandidatesthatareguaranteedtobeinfrequentgiventhefrequent -itemsetsfoundinthepreviousiteration(steps5and6).Candidategenerationandpruningisimplementedusingthefunctionscandidate-genandcandidate-prune,whicharedescribedinSection5.2.3 .

(61)+(62)+(63)=6+15+20=41

(61)+(42)+1=6+6+1=13

CkFk

F1

(k−1)

Tocountthesupportofthecandidates,thealgorithmneedstomakeanadditionalpassoverthedataset(steps7–12).Thesubsetfunctionisusedtodetermineallthecandidateitemsetsin thatarecontainedineachtransactiont.TheimplementationofthisfunctionisdescribedinSection5.2.4 .Aftercountingtheirsupports,thealgorithmeliminatesallcandidateitemsetswhosesupportcountsarelessthan (step13).Thealgorithmterminateswhentherearenonewfrequentitemsetsgenerated,i.e., (step14).

ThefrequentitemsetgenerationpartoftheApriorialgorithmhastwoimportantcharacteristics.First,itisalevel-wisealgorithm;i.e.,ittraversestheitemsetlatticeonelevelatatime,fromfrequent1-itemsetstothemaximumsizeoffrequentitemsets.Second,itemploysagenerate-and-teststrategyforfindingfrequentitemsets.Ateachiteration(level),newcandidateitemsetsaregeneratedfromthefrequentitemsetsfoundinthepreviousiteration.Thesupportforeachcandidateisthencountedandtestedagainsttheminsupthreshold.Thetotalnumberofiterationsneededbythealgorithmis ,where isthemaximumsizeofthefrequentitemsets.

5.2.3CandidateGenerationandPruning

Thecandidate-genandcandidate-prunefunctionsshowninSteps5and6ofAlgorithm5.1 generatecandidateitemsetsandprunesunnecessaryonesbyperformingthefollowingtwooperations,respectively:

Ck

N×minsup

Fk=∅

kmax+1kmax

1. CandidateGeneration.Thisoperationgeneratesnewcandidatek-itemsetsbasedonthefrequent -itemsetsfoundinthepreviousiteration.

Algorithm5.1FrequentitemsetgenerationoftheApriorialgorithm.

∈ ∧

∈ ∧

2. CandidatePruning.Thisoperationeliminatessomeofthecandidatek-itemsetsusingsupport-basedpruning,i.e.byremovingk-itemsetswhosesubsetsareknowntobeinfrequentinpreviousiterations.Note

(k−1)

thatthispruningisdonewithoutcomputingtheactualsupportofthesek-itemsets(whichcouldhaverequiredcomparingthemagainsteachtransaction).

CandidateGenerationInprinciple,therearemanywaystogeneratecandidateitemsets.Aneffectivecandidategenerationproceduremustbecompleteandnon-redundant.Acandidategenerationprocedureissaidtobecompleteifitdoesnotomitanyfrequentitemsets.Toensurecompleteness,thesetofcandidateitemsetsmustsubsumethesetofallfrequentitemsets,i.e., .Acandidategenerationprocedureisnon-redundantifitdoesnotgeneratethesamecandidateitemsetmorethanonce.Forexample,thecandidateitemset{a,b,c,d}canbegeneratedinmanyways—bymerging{a,b,c}with{d},{b,d}with{a,c},{c}with{a,b,d},etc.Generationofduplicatecandidatesleadstowastedcomputationsandthusshouldbeavoidedforefficiencyreasons.Also,aneffectivecandidategenerationprocedureshouldavoidgeneratingtoomanyunnecessarycandidates.Acandidateitemsetisunnecessaryifatleastoneofitssubsetsisinfrequent,andthus,eliminatedinthecandidatepruningstep.

Next,wewillbrieflydescribeseveralcandidategenerationprocedures,includingtheoneusedbythecandidate-genfunction.

Brute-ForceMethod

Thebrute-forcemethodconsiderseveryk-itemsetasapotentialcandidateandthenappliesthecandidatepruningsteptoremoveanyunnecessarycandidateswhosesubsetsareinfrequent(seeFigure5.6 ).Thenumberofcandidateitemsetsgeneratedatlevelkisequalto ,wheredisthetotalnumberofitems.Althoughcandidategenerationisrathertrivial,candidate

∀k:Fk⊆Ck

(dk)

pruningbecomesextremelyexpensivebecausealargenumberofitemsetsmustbeexamined.

Figure5.6.Abrute-forcemethodforgeneratingcandidate3-itemsets.

Method

Analternativemethodforcandidategenerationistoextendeachfrequent-itemsetwithfrequentitemsthatarenotpartofthe -itemset.Figure

5.7 illustrateshowafrequent2-itemsetsuchas{ , }canbeaugmentedwithafrequentitemsuchas toproduceacandidate3-itemset{ , , }.

Fk−1×F1

(k−1) (k−1)

Figure5.7.Generatingandpruningcandidatek-itemsetsbymergingafrequent -itemsetwithafrequentitem.Notethatsomeofthecandidatesareunnecessarybecausetheirsubsetsareinfrequent.

Theprocedureiscompletebecauseeveryfrequentk-itemsetiscomposedofafrequent -itemsetandafrequent1-itemset.Therefore,allfrequentk-itemsetsarepartofthecandidatek-itemsetsgeneratedbythisprocedure.Figure5.7 showsthatthe candidategenerationmethodonlyproducesfourcandidate3-itemsets,insteadofthe

itemsetsproducedbythebrute-forcemethod.The methodgenerateslowernumberofcandidatesbecauseeverycandidateisguaranteedtocontainatleastonefrequent -itemset.Whilethisprocedureisasubstantialimprovementoverthebrute-forcemethod,itcanstillproducealargenumberofunnecessarycandidates,astheremainingsubsetsofacandidateitemsetcanstillbeinfrequent.

Notethattheapproachdiscussedabovedoesnotpreventthesamecandidate

(k−1)

(k−1)

Fk−1×F1

(63)=20 Fk−1×F1

(k−1)

itemsetfrombeinggeneratedmorethanonce.Forinstance,{ , ,}canbegeneratedbymerging{ , }with{ },{ ,}with{ },or{ , }with{ }.Onewaytoavoid

generatingduplicatecandidatesisbyensuringthattheitemsineachfrequentitemsetarekeptsortedintheirlexicographicorder.Forexample,itemsetssuchas{ , },{ , , },and{ , }followlexicographicorderastheitemswithineveryitemsetarearrangedalphabetically.Eachfrequent -itemsetXisthenextendedwithfrequentitemsthatarelexicographicallylargerthantheitemsinX.Forexample,theitemset{ , }canbeaugmentedwith{ }becauseMilkislexicographicallylargerthanBreadandDiapers.However,weshouldnotaugment{ , }with{ }nor{ , }with{ }becausetheyviolatethelexicographicorderingcondition.Everycandidatek-itemsetisthusgeneratedexactlyonce,bymergingthelexicographicallylargestitemwiththeremaining itemsintheitemset.Ifthemethodisusedinconjunctionwithlexicographicordering,thenonlytwocandidate3-itemsetswillbeproducedintheexampleillustratedinFigure5.7 .{ , , }and{ , , }willnotbegeneratedbecause{ , }isnotafrequent2-itemset.

Method

Thiscandidategenerationprocedure,whichisusedinthecandidate-genfunctionoftheApriorialgorithm,mergesapairoffrequent -itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Let

and beapairoffrequent -itemsets,arrangedlexicographically.AandBaremergediftheysatisfythefollowingconditions:

(k−1)

k−1 Fk−1×F1

Fk−1×Fk−1

(k−1)k−2 A=

{a1,a2,…,ak−1} B={b1,b2,…,bk−1} (k−1)

ai=bi(fori=1,2,…,k−2).

Notethatinthiscase, becauseAandBaretwodistinctitemsets.Thecandidatek-itemsetgeneratedbymergingAandBconsistsofthefirstcommonitemsfollowedby and inlexicographicorder.This

candidategenerationprocedureiscomplete,becauseforeverylexicographicallyorderedfrequentk-itemset,thereexiststwolexicographicallyorderedfrequent -itemsetsthathaveidenticalitemsinthefirstpositions.

InFigure5.8 ,thefrequentitemsets{ , }and{ , }aremergedtoformacandidate3-itemset{ , , }.Thealgorithmdoesnothavetomerge{ , }with{ , }becausethefirstiteminbothitemsetsisdifferent.Indeed,if{ , , }isaviablecandidate,itwouldhavebeenobtainedbymerging{ , }with{ , }instead.Thisexampleillustratesboththecompletenessofthecandidategenerationprocedureandtheadvantagesofusinglexicographicorderingtopreventduplicatecandidates.Also,ifweorderthefrequent -itemsetsaccordingtotheirlexicographicrank,itemsetswithidenticalfirstitemswouldtakeconsecutiveranks.Asaresult,the candidategenerationmethodwouldconsidermergingafrequentitemsetonlywithonesthatoccupythenextfewranksinthesortedlist,thussavingsomecomputations.

ak−1≠bk−1k

−2 ak−1 bk−1

(k−1) k−2

(k−1)k−2

Fk−1×Fk−1

Figure5.8.Generatingandpruningcandidatek-itemsetsbymergingpairsoffrequent

-itemsets.

Figure5.8 showsthatthe candidategenerationprocedureresultsinonlyonecandidate3-itemset.Thisisaconsiderablereductionfromthefourcandidate3-itemsetsgeneratedbythe method.Thisisbecausethe methodensuresthateverycandidatek-itemsetcontainsatleasttwofrequent -itemsets,thusgreatlyreducingthenumberofcandidatesthataregeneratedinthisstep.

Notethattherecanbemultiplewaysofmergingtwofrequent -itemsetsinthe procedure,oneofwhichismergingiftheirfirst itemsareidentical.Analternateapproachcouldbetomergetwofrequent -itemsetsAandBifthelast itemsofAareidenticaltothefirstitemsetsofB.Forexample,{ , }and{ , }couldbemergedusingthisapproachtogeneratethecandidate3-itemset{ ,

, }.Aswewillseelater,thisalternate procedureis

(k−1)

Fk−1×Fk−1

Fk−1×F1Fk−1×Fk−1

(k−1)

(k−1)Fk−1×Fk−1 k−2

(k−1)k−2 k−2

Fk−1×Fk−1

usefulingeneratingsequentialpatterns,whichwillbediscussedinChapter6 .

CandidatePruningToillustratethecandidatepruningoperationforacandidatek-itemset,

,consideritskpropersubsets, .Ifanyofthemareinfrequent,thenXisimmediatelyprunedbyusingtheAprioriprinciple.Notethatwedon'tneedtoexplicitlyensurethatallsubsetsofXofsizelessthan arefrequent(seeExercise7).Thisapproachgreatlyreducesthenumberofcandidateitemsetsconsideredduringsupportcounting.Forthebrute-forcecandidategenerationmethod,candidatepruningrequirescheckingonlyksubsetsofsize foreachcandidatek-itemset.However,sincethe candidategenerationstrategyensuresthatatleastoneofthe -sizesubsetsofeverycandidatek-itemsetisfrequent,weonlyneedtocheckfortheremaining subsets.Likewise,thestrategyrequiresexaminingonly subsetsofeverycandidatek-itemset,

sincetwoofits -sizesubsetsarealreadyknowntobefrequentinthecandidategenerationstep.

5.2.4SupportCounting

Supportcountingistheprocessofdeterminingthefrequencyofoccurrenceforeverycandidateitemsetthatsurvivesthecandidatepruningstep.Supportcountingisimplementedinsteps6through11ofAlgorithm5.1 .Abrute-forceapproachfordoingthisistocompareeachtransactionagainsteverycandidateitemset(seeFigure5.2 )andtoupdatethesupportcountsofcandidatescontainedinatransaction.Thisapproachiscomputationally

X={i1,i2,…,ik} X−{ij}(∀j=1,2,…,k)

k−1

k−1Fk−1×F1

(k−1)k−1 Fk−1×Fk

−1 k−2(k−1)

expensive,especiallywhenthenumbersoftransactionsandcandidateitemsetsarelarge.

Analternativeapproachistoenumeratetheitemsetscontainedineachtransactionandusethemtoupdatethesupportcountsoftheirrespectivecandidateitemsets.Toillustrate,consideratransactiontthatcontainsfiveitems,{1,2,3,5,6}.Thereare itemsetsofsize3containedinthistransaction.Someoftheitemsetsmaycorrespondtothecandidate3-itemsetsunderinvestigation,inwhichcase,theirsupportcountsareincremented.Othersubsetsoftthatdonotcorrespondtoanycandidatescanbeignored.

Figure5.9 showsasystematicwayforenumeratingthe3-itemsetscontainedint.Assumingthateachitemsetkeepsitsitemsinincreasinglexicographicorder,anitemsetcanbeenumeratedbyspecifyingthesmallestitemfirst,followedbythelargeritems.Forinstance,given ,allthe3-itemsetscontainedintmustbeginwithitem1,2,or3.Itisnotpossibletoconstructa3-itemsetthatbeginswithitems5or6becausethereareonlytwoitemsintwhoselabelsaregreaterthanorequalto5.Thenumberofwaystospecifythefirstitemofa3-itemsetcontainedintisillustratedbytheLevel1prefixtreestructuredepictedinFigure5.9 .Forinstance,1representsa3-itemsetthatbeginswithitem1,followedbytwomoreitemschosenfromtheset{2,3,5,6}.

(53)=10

t={1,2,3,5,6}

2356

Figure5.9.Enumeratingsubsetsofthreeitemsfromatransactiont.

Afterfixingthefirstitem,theprefixtreestructureatLevel2representsthenumberofwaystoselecttheseconditem.Forexample,12correspondstoitemsetsthatbeginwiththeprefix(12)andarefollowedbytheitems3,5,or6.Finally,theprefixtreestructureatLevel3representsthecompletesetof3-itemsetscontainedint.Forexample,the3-itemsetsthatbeginwithprefix{12}are{1,2,3},{1,2,5},and{1,2,6},whilethosethatbeginwithprefix{23}are{2,3,5}and{2,3,6}.

TheprefixtreestructureshowninFigure5.9 demonstrateshowitemsetscontainedinatransactioncanbesystematicallyenumerated,i.e.,byspecifyingtheiritemsonebyone,fromtheleftmostitemtotherightmostitem.Westillhavetodeterminewhethereachenumerated3-itemsetcorresponds

356

toanexistingcandidateitemset.Ifitmatchesoneofthecandidates,thenthesupportcountofthecorrespondingcandidateisincremented.InthenextSection,weillustratehowthismatchingoperationcanbeperformedefficientlyusingahashtreestructure.

SupportCountingUsingaHashTree*IntheApriorialgorithm,candidateitemsetsarepartitionedintodifferentbucketsandstoredinahashtree.Duringsupportcounting,itemsetscontainedineachtransactionarealsohashedintotheirappropriatebuckets.Thatway,insteadofcomparingeachitemsetinthetransactionwitheverycandidateitemset,itismatchedonlyagainstcandidateitemsetsthatbelongtothesamebucket,asshowninFigure5.10 .

Figure5.10.Countingthesupportofitemsetsusinghashstructure.

Figure5.11 showsanexampleofahashtreestructure.Eachinternalnodeofthetreeusesthefollowinghashfunction, ,wheremodeh(p)=(p−1)mod3,

referstothemodulo(remainder)operator,todeterminewhichbranchofthecurrentnodeshouldbefollowednext.Forexample,items1,4,and7arehashedtothesamebranch(i.e.,theleftmostbranch)becausetheyhavethesameremainderafterdividingthenumberby3.Allcandidateitemsetsarestoredattheleafnodesofthehashtree.ThehashtreeshowninFigure5.11 contains15candidate3-itemsets,distributedacross9leafnodes.

Figure5.11.Hashingatransactionattherootnodeofahashtree.

Considerthetransaction, .Toupdatethesupportcountsofthecandidateitemsets,thehashtreemustbetraversedinsuchawaythatall

t={1,2,3,4,5,6}

theleafnodescontainingcandidate3-itemsetsbelongingtotmustbevisitedatleastonce.Recallthatthe3-itemsetscontainedintmustbeginwithitems1,2,or3,asindicatedbytheLevel1prefixtreestructureshowninFigure5.9 .Therefore,attherootnodeofthehashtree,theitems1,2,and3ofthetransactionarehashedseparately.Item1ishashedtotheleftchildoftherootnode,item2ishashedtothemiddlechild,anditem3ishashedtotherightchild.Atthenextlevelofthetree,thetransactionishashedontheseconditemlistedintheLevel2treestructureshowninFigure5.9 .Forexample,afterhashingonitem1attherootnode,items2,3,and5ofthetransactionarehashed.Basedonthehashfunction,items2and5arehashedtothemiddlechild,whileitem3ishashedtotherightchild,asshowninFigure5.12 .Thisprocesscontinuesuntiltheleafnodesofthehashtreearereached.Thecandidateitemsetsstoredatthevisitedleafnodesarecomparedagainstthetransaction.Ifacandidateisasubsetofthetransaction,itssupportcountisincremented.Notethatnotalltheleafnodesarevisitedwhiletraversingthehashtree,whichhelpsinreducingthecomputationalcost.Inthisexample,5outofthe9leafnodesarevisitedand9outofthe15itemsetsarecomparedagainstthetransaction.

Figure5.12.Subsetoperationontheleftmostsubtreeoftherootofacandidatehashtree.

5.2.5ComputationalComplexity

ThecomputationalcomplexityoftheApriorialgorithm,whichincludesbothitsruntimeandstorage,canbeaffectedbythefollowingfactors.

SupportThreshold

Loweringthesupportthresholdoftenresultsinmoreitemsetsbeingdeclaredasfrequent.Thishasanadverseeffectonthecomputationalcomplexityofthealgorithmbecausemorecandidateitemsetsmustbegeneratedandcounted

ateverylevel,asshowninFigure5.13 .Themaximumsizeoffrequentitemsetsalsotendstoincreasewithlowersupportthresholds.ThisincreasesthetotalnumberofiterationstobeperformedbytheApriorialgorithm,furtherincreasingthecomputationalcost.

Figure5.13.Effectofsupportthresholdonthenumberofcandidateandfrequentitemsetsobtainedfromabenchmarkdataset.

NumberofItems(Dimensionality)

Asthenumberofitemsincreases,morespacewillbeneededtostorethesupportcountsofitems.Ifthenumberoffrequentitemsalsogrowswiththedimensionalityofthedata,theruntimeandstoragerequirementswillincreasebecauseofthelargernumberofcandidateitemsetsgeneratedbythealgorithm.

NumberofTransactions

BecausetheApriorialgorithmmakesrepeatedpassesoverthetransactiondataset,itsruntimeincreaseswithalargernumberoftransactions.

AverageTransactionWidth

Fordensedatasets,theaveragetransactionwidthcanbeverylarge.ThisaffectsthecomplexityoftheApriorialgorithmintwoways.First,themaximumsizeoffrequentitemsetstendstoincreaseastheaveragetransactionwidthincreases.Asaresult,morecandidateitemsetsmustbeexaminedduringcandidategenerationandsupportcounting,asillustratedinFigure5.14 .Second,asthetransactionwidthincreases,moreitemsetsarecontainedinthetransaction.Thiswillincreasethenumberofhashtreetraversalsperformedduringsupportcounting.

AdetailedanalysisofthetimecomplexityfortheApriorialgorithmispresentednext.

Figure5.14.

Effectofaveragetransactionwidthonthenumberofcandidateandfrequentitemsetsobtainedfromasyntheticdataset.

Generationoffrequent1-itemsets

Foreachtransaction,weneedtoupdatethesupportcountforeveryitempresentinthetransaction.Assumingthatwistheaveragetransactionwidth,thisoperationrequiresO(Nw)time,whereNisthetotalnumberoftransactions.

Candidategeneration

Togeneratecandidatek-itemsets,pairsoffrequent -itemsetsaremergedtodeterminewhethertheyhaveatleast itemsincommon.Eachmergingoperationrequiresatmost equalitycomparisons.Everymergingstepcanproduceatmostoneviablecandidatek-itemset,whileintheworst-case,thealgorithmmusttrytomergeeverypairoffrequent -itemsetsfoundinthepreviousiteration.Therefore,theoverallcostofmergingfrequentitemsetsis

wherewisthemaximumtransactionwidth.Ahashtreeisalsoconstructedduringcandidategenerationtostorethecandidateitemsets.Becausethemaximumdepthofthetreeisk,thecostforpopulatingthehashtreewithcandidateitemsetsis .Duringcandidatepruning,weneedtoverifythatthe subsetsofeverycandidatek-itemsetarefrequent.SincethecostforlookingupacandidateinahashtreeisO(k),thecandidatepruningsteprequires time.

Supportcounting

(k−1)k−2

k−2

(k−1)

∑k=2w(k−2)|Ck|<Costofmerging<∑k=2w(k−2)|Fk−1|2,

O(∑k=2wk|Ck|)k−2

O(∑k=2wk(k−2)|Ck|)

Eachtransactionofwidth produces itemsetsofsizek.Thisisalsotheeffectivenumberofhashtreetraversalsperformedforeachtransaction.Thecostforsupportcountingis ,wherewisthemaximumtransactionwidthand isthecostforupdatingthesupportcountofacandidatek-itemsetinthehashtree.

|t| (|t|k)

O(N∑k(wk)αk)αk

5.3RuleGenerationThisSectiondescribeshowtoextractassociationrulesefficientlyfromagivenfrequentitemset.Eachfrequentk-itemset,Y,canproduceuptoassociationrules,ignoringrulesthathaveemptyantecedentsorconsequents

or ).AnassociationrulecanbeextractedbypartitioningtheitemsetYintotwonon-emptysubsets,Xand ,suchthat satisfiestheconfidencethreshold.Notethatallsuchrulesmusthavealreadymetthesupportthresholdbecausetheyaregeneratedfromafrequentitemset.

Example5.2.Let beafrequentitemset.Therearesixcandidateassociationrulesthatcanbegeneratedfrom

,and .AseachoftheirsupportisidenticaltothesupportforX,alltherulessatisfythesupportthreshold.

Computingtheconfidenceofanassociationruledoesnotrequireadditionalscansofthetransactiondataset.Considertherule ,whichisgeneratedfromthefrequentitemset .Theconfidenceforthisruleis

.Because{1,2,3}isfrequent,theanti-monotonepropertyofsupportensuresthat{1,2}mustbefrequent,too.Sincethesupportcountsforbothitemsetswerealreadyfoundduringfrequentitemsetgeneration,thereisnoneedtoreadtheentiredatasetagain.

5.3.1Confidence-BasedPruning

2k−2

∅→Y Y→∅Y−X X→Y−X

X={a,b,c}X:{a,b}→{c},{a,c}→{b},{b,c}→{a},{a}

→{b,c},{b}→{a,c} {c}→{a,b}

{1,2}→{3}X={1,2,3}

σ{(1,2,3})/σ({1,2})

Confidencedoesnotshowtheanti-monotonepropertyinthesamewayasthesupportmeasure.Forexample,theconfidencefor canbelarger,smaller,orequaltotheconfidenceforanotherrule ,where and

(seeExercise3onpage439).Nevertheless,ifwecomparerulesgeneratedfromthesamefrequentitemsetY,thefollowingtheoremholdsfortheconfidencemeasure.

Theorem5.2.LetYbeanitemsetandXisasubsetofY.Ifaruledoesnotsatisfytheconfidencethreshold,thenanyrule

,where isasubsetofX,mustnotsatisfytheconfidencethresholdaswell.

Toprovethistheorem,considerthefollowingtworules: and,where .Theconfidenceoftherulesare and ,

respectively.Since isasubsetofX, .Therefore,theformerrulecannothaveahigherconfidencethanthelatterrule.

5.3.2RuleGenerationinAprioriAlgorithm

TheApriorialgorithmusesalevel-wiseapproachforgeneratingassociationrules,whereeachlevelcorrespondstothenumberofitemsthatbelongtothe

X→YX˜→Y˜ X˜⊆X

Y˜⊆Y

X→Y−XX˜→Y

−X˜ X˜

X˜→Y−X˜ X→Y−X X˜⊂X σ(Y)/σ(X˜) σ(Y)/σ(X)

X˜ σ(X˜)/σ(X)

ruleconsequent.Initially,allthehighconfidencerulesthathaveonlyoneitemintheruleconsequentareextracted.Theserulesarethenusedtogeneratenewcandidaterules.Forexample,if and arehighconfidencerules,thenthecandidaterule isgeneratedbymergingtheconsequentsofbothrules.Figure5.15 showsalatticestructurefortheassociationrulesgeneratedfromthefrequentitemset{a,b,c,d}.Ifanynodeinthelatticehaslowconfidence,thenaccordingtoTheorem5.2 ,theentiresubgraphspannedbythenodecanbeprunedimmediately.Supposetheconfidencefor islow.Alltherulescontainingitemainitsconsequent,including ,and canbediscarded.

Figure5.15.Pruningofassociationrulesusingtheconfidencemeasure.

{acd}→{b} {abd}→{c}{ad}→{bc}

{bcd}→{a}{cd}→{ab},{bd}→{ac},{bc}→{ad} {d}→{abc}

ApseudocodefortherulegenerationstepisshowninAlgorithms5.2 and5.3 .Notethesimilaritybetweenthe proceduregiveninAlgorithm5.3 andthefrequentitemsetgenerationproceduregiveninAlgorithm5.1 .Theonlydifferenceisthat,inrulegeneration,wedonothavetomakeadditionalpassesoverthedatasettocomputetheconfidenceofthecandidaterules.Instead,wedeterminetheconfidenceofeachrulebyusingthesupportcountscomputedduringfrequentitemsetgeneration.

Algorithm5.2RulegenerationoftheApriorialgorithm.

Algorithm5.3Procedureap-genrules .

(fk,Hm)

5.3.3AnExample:CongressionalVotingRecords

ThisSectiondemonstratestheresultsofapplyingassociationanalysistothevotingrecordsofmembersoftheUnitedStatesHouseofRepresentatives.Thedataisobtainedfromthe1984CongressionalVotingRecordsDatabase,whichisavailableattheUCImachinelearningdatarepository.Eachtransactioncontainsinformationaboutthepartyaffiliationforarepresentativealongwithhisorhervotingrecordon16keyissues.Thereare435transactionsand34itemsinthedataset.ThesetofitemsarelistedinTable5.3 .

Table5.3.Listofbinaryattributesfromthe1984UnitedStatesCongressionalVotingRecords.Source:TheUCImachinelearningrepository.

1. Republican2. Democrat3.4.5.6.7.

handicapped-infants=yeshandicapped-infants=nowaterprojectcostsharing=yeswaterprojectcostsharing=nobudget-resolution=yes

8.9.

10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.

TheApriorialgorithmisthenappliedtothedatasetwith and.Someofthehighconfidencerulesextractedbythealgorithm

areshowninTable5.4 .ThefirsttworulessuggestthatmostofthememberswhovotedyesforaidtoElSalvadorandnoforbudgetresolutionandMXmissileareRepublicans;whilethosewhovotednoforaidtoElSalvadorandyesforbudgetresolutionandMXmissileareDemocrats.These

budget-resolution=nophysicianfeefreeze=yesphysicianfeefreeze=noaidtoElSalvador=yesaidtoElSalvador=noreligiousgroupsinschools=yesreligiousgroupsinschools=noanti-satellitetestban=yesanti-satellitetestban=noaidtoNicaragua=yesaidtoNicaragua=noMX-missile=yesMX-missile=noimmigration=yesimmigration=nosynfuelcorporationcutback=yessynfuelcorporationcutback=noeducationspending=yeseducationspending=noright-to-sue=yesright-to-sue=nocrime=yescrime=noduty-free-exports=yesduty-free-exports=noexportadministrationact=yesexportadministrationact=no

minsup=30%minconf=90%

highconfidencerulesshowthekeyissuesthatdividemembersfrombothpoliticalparties.

Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.

AssociationRule Confidence

91.0%

97.5%

93.5%

100%

{budgetresolution=no,MX-missile=no,aidtoElSalvador=yes}→{Republican}

{budgetresolution=yes,MX-missile=yes,aidtoElSalvador=no}→{Democrat}

{crime=yes,right-to-sue=yes,physicianfeefreeze=yes}→{Republican}

{crime=no,right-to-sue=no,physicianfeefreeze=no}→{Democrat}

5.4CompactRepresentationofFrequentItemsetsInpractice,thenumberoffrequentitemsetsproducedfromatransactiondatasetcanbeverylarge.Itisusefultoidentifyasmallrepresentativesetoffrequentitemsetsfromwhichallotherfrequentitemsetscanbederived.TwosuchrepresentationsarepresentedinthisSectionintheformofmaximalandclosedfrequentitemsets.

5.4.1MaximalFrequentItemsets

Definition5.3.(MaximalFrequentItemset.)Afrequentitemsetismaximalifnoneofitsimmediatesupersetsarefrequent.

Toillustratethisconcept,considertheitemsetlatticeshowninFigure5.16 .Theitemsetsinthelatticearedividedintotwogroups:thosethatarefrequentandthosethatareinfrequent.Afrequentitemsetborder,whichisrepresentedbyadashedline,isalsoillustratedinthediagram.Everyitemsetlocatedabovetheborderisfrequent,whilethoselocatedbelowtheborder(theshadednodes)areinfrequent.Amongtheitemsetsresidingneartheborder,

{a,d},{a,c,e},and{b,c,d,e}aremaximalfrequentitemsetsbecausealloftheirimmediatesupersetsareinfrequent.Forexample,theitemset{a,d}ismaximalfrequentbecauseallofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e},areinfrequent.Incontrast,{a,c}isnon-maximalbecauseoneofitsimmediatesupersets,{a,c,e},isfrequent.

Figure5.16.Maximalfrequentitemset.

Maximalfrequentitemsetseffectivelyprovideacompactrepresentationoffrequentitemsets.Inotherwords,theyformthesmallestsetofitemsetsfromwhichallfrequentitemsetscanbederived.Forexample,everyfrequentitemsetinFigure5.16 isasubsetofoneofthethreemaximalfrequent

itemsets,{a,d},{a,c,e},and{b,c,d,e}.Ifanitemsetisnotapropersubsetofanyofthemaximalfrequentitemsets,thenitiseitherinfrequent(e.g.,{a,d,e})ormaximalfrequentitself(e.g.,{b,c,d,e}).Hence,themaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}provideacompactrepresentationofthefrequentitemsetsshowninFigure5.16 .Enumeratingallthesubsetsofmaximalfrequentitemsetsgeneratesthecompletelistofallfrequentitemsets.

Maximalfrequentitemsetsprovideavaluablerepresentationfordatasetsthatcanproduceverylong,frequentitemsets,asthereareexponentiallymanyfrequentitemsetsinsuchdata.Nevertheless,thisapproachispracticalonlyifanefficientalgorithmexiststoexplicitlyfindthemaximalfrequentitemsets.WebrieflydescribeonesuchapproachinSection5.5 .

Despiteprovidingacompactrepresentation,maximalfrequentitemsetsdonotcontainthesupportinformationoftheirsubsets.Forexample,thesupportofthemaximalfrequentitemsets{a,c,e},{a,d},and{b,c,d,e}donotprovideanyinformationaboutthesupportoftheirsubsetsexceptthatitmeetsthesupportthreshold.Anadditionalpassoverthedatasetisthereforeneededtodeterminethesupportcountsofthenon-maximalfrequentitemsets.Insomecases,itisdesirabletohaveaminimalrepresentationofitemsetsthatpreservesthesupportinformation.WedescribesucharepresentationinthenextSection.

5.4.2ClosedItemsets

Closeditemsetsprovideaminimalrepresentationofallitemsetswithoutlosingtheirsupportinformation.Aformaldefinitionofacloseditemsetispresentedbelow.

Definition5.4.(ClosedItemset.)AnitemsetXisclosedifnoneofitsimmediatesupersetshasexactlythesamesupportcountasX.

Putanotherway,XisnotclosedifatleastoneofitsimmediatesupersetshasthesamesupportcountasX.ExamplesofcloseditemsetsareshowninFigure5.17 .Tobetterillustratethesupportcountofeachitemset,wehaveassociatedeachnode(itemset)inthelatticewithalistofitscorrespondingtransactionIDs.Forexample,sincethenode{b,c}isassociatedwithtransactionIDs1,2,and3,itssupportcountisequaltothree.Fromthetransactionsgiveninthisdiagram,noticethatthesupportfor{b}isidenticalto{b,c}.Thisisbecauseeverytransactionthatcontainsbalsocontainsc.Hence,{b}isnotacloseditemset.Similarly,sincecoccursineverytransactionthatcontainsbothaandd,theitemset{a,d}isnotclosedasithasthesamesupportasitssuperset{a,c,d}.Ontheotherhand,{b,c}isacloseditemsetbecauseitdoesnothavethesamesupportcountasanyofitssupersets.

Figure5.17.Anexampleoftheclosedfrequentitemsets(withminimumsupportequalto40%).

Aninterestingpropertyofcloseditemsetsisthatifweknowtheirsupportcounts,wecanderivethesupportcountofeveryotheritemsetintheitemsetlatticewithoutmakingadditionalpassesoverthedataset.Forexample,considerthe2-itemset{b,e}inFigure5.17 .Since{b,e}isnotclosed,itssupportmustbeequaltothesupportofoneofitsimmediatesupersets,{a,b,e},{b,c,e},and{b,d,e}.Further,noneofthesupersetsof{b,e}canhaveasupportgreaterthanthesupportof{b,e},duetotheanti-monotonenatureofthesupportmeasure.Hence,thesupportof{b,e}canbecomputedbyexaminingthesupportcountsofallofitsimmediatesupersetsofsizethree

andtakingtheirmaximumvalue.Ifanimmediatesupersetisclosed(e.g.,{b,c,e}),wewouldknowitssupportcount.Otherwise,wecanrecursivelycomputeitssupportbyexaminingthesupportsofitsimmediatesupersetsofsizefour.Ingeneral,thesupportcountofanynon-closed -itemsetcanbedeterminedaslongasweknowthesupportcountsofallk-itemsets.Hence,onecandeviseaniterativealgorithmthatcomputesthesupportcountsofitemsetsatlevel usingthesupportcountsofitemsetsatlevelk,startingfromthelevel ,where isthesizeofthelargestcloseditemset.

Eventhoughcloseditemsetsprovideacompactrepresentationofthesupportcountsofallitemsets,theycanstillbeexponentiallylargeinnumber.Moreover,formostpracticalapplications,weonlyneedtodeterminethesupportcountofallfrequentitemsets.Inthisregard,closedfrequentitem-setsprovideacompactrepresentationofthesupportcountsofallfrequentitemsets,whichcanbedefinedasfollows.

Definition5.5.(ClosedFrequentItemset.)Anitemsetisaclosedfrequentitemsetifitisclosedanditssupportisgreaterthanorequaltominsup.

Inthepreviousexample,assumingthatthesupportthresholdis40%,{b,c}isaclosedfrequentitemsetbecauseitssupportis60%.InFigure5.17 ,theclosedfrequentitemsetsareindicatedbytheshadednodes.

Algorithmsareavailabletoexplicitlyextractclosedfrequentitemsetsfromagivendataset.InterestedreadersmayrefertotheBibliographicNotesatthe

(k−1)

k−1kmax kmax

endofthischapterforfurtherdiscussionsofthesealgorithms.Wecanuseclosedfrequentitemsetstodeterminethesupportcountsforallnon-closedfrequentitemsets.Forexample,considerthefrequentitemset{a,d}showninFigure5.17 .Becausethisitemsetisnotclosed,itssupportcountmustbeequaltothemaximumsupportcountofitsimmediatesupersets,{a,b,d},{a,c,d},and{a,d,e}.Also,since{a,d}isfrequent,weonlyneedtoconsiderthesupportofitsfrequentsupersets.Ingeneral,thesupportcountofeverynon-closedfrequentk-itemsetcanbeobtainedbyconsideringthesupportofallitsfrequentsupersetsofsize .Forexample,sincetheonlyfrequentsupersetof{a,d}is{a,c,d},itssupportisequaltothesupportof{a,c,d},whichis2.Usingthismethodology,analgorithmcanbedevelopedtocomputethesupportforeveryfrequentitemset.ThepseudocodeforthisalgorithmisshowninAlgorithm5.4 .Thealgorithmproceedsinaspecific-to-generalfashion,i.e.,fromthelargesttothesmallestfrequentitemsets.Thisisbecause,inordertofindthesupportforanon-closedfrequentitemset,thesupportforallofitssupersetsmustbeknown.Notethatthesetofallfrequentitemsetscanbeeasilycomputedbytakingtheunionofallsubsetsoffrequentcloseditemsets.

Algorithm5.4Supportcountingusingclosedfrequentitemsets.

k+1

⋅ ′⋅ ′∈ ⊂ ′

Toillustratetheadvantageofusingclosedfrequentitemsets,considerthedatasetshowninTable5.5 ,whichcontainstentransactionsandfifteenitems.Theitemscanbedividedintothreegroups:(1)GroupA,whichcontainsitems through ;(2)GroupB,whichcontainsitems through;and(3)GroupC,whichcontainsitems through .Assumingthatthe

supportthresholdis20%,itemsetsinvolvingitemsfromthesamegrouparefrequent,butitemsetsinvolvingitemsfromdifferentgroupsareinfrequent.Thetotalnumberoffrequentitemsetsisthus .However,thereareonlyfourclosedfrequentitemsetsinthedata:

and .Itisoftensufficienttopresentonlytheclosedfrequentitemsetstotheanalystsinsteadoftheentiresetoffrequentitemsets.

Table5.5.Atransactiondatasetforminingcloseditemsets.

TID

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

4 0 0 1 1 0 1 1 1 1 1 0 0 0 0 0

5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

a1 a5 b1b5 c1 c5

3×(25−1)=93

({a3,a4},{a1,a2,a3,a4,a5},{b1,b2,b3,b4,b5}, {c1,c2,c3,c4,c5})

a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5

6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Finally,notethatallmaximalfrequentitemsetsareclosedbecausenoneofthemaximalfrequentitemsetscanhavethesamesupportcountastheirimmediatesupersets.Therelationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsetsareshowninFigure5.18 .

Figure5.18.Relationshipsamongfrequent,closed,closedfrequent,andmaximalfrequentitemsets.

5.5AlternativeMethodsforGeneratingFrequentItemsets*Aprioriisoneoftheearliestalgorithmstohavesuccessfullyaddressedthecombinatorialexplosionoffrequentitemsetgeneration.ItachievesthisbyapplyingtheAprioriprincipletoprunetheexponentialsearchspace.Despiteitssignificantperformanceimprovement,thealgorithmstillincursconsiderableI/Ooverheadsinceitrequiresmakingseveralpassesoverthetransactiondataset.Inaddition,asnotedinSection5.2.5 ,theperformanceoftheApriorialgorithmmaydegradesignificantlyfordensedatasetsbecauseoftheincreasingwidthoftransactions.SeveralalternativemethodshavebeendevelopedtoovercometheselimitationsandimproveupontheefficiencyoftheApriorialgorithm.Thefollowingisahigh-leveldescriptionofthesemethods.

TraversalofItemsetLattice

AsearchforfrequentitemsetscanbeconceptuallyviewedasatraversalontheitemsetlatticeshowninFigure5.1 .Thesearchstrategyemployedbyanalgorithmdictateshowthelatticestructureistraversedduringthefrequentitemsetgenerationprocess.Somesearchstrategiesarebetterthanothers,dependingontheconfigurationoffrequentitemsetsinthelattice.Anoverviewofthesestrategiesispresentednext.

General-to-SpecificversusSpecific-to-General:TheApriorialgorithmusesageneral-to-specificsearchstrategy,wherepairsoffrequent -itemsetsaremergedtoobtaincandidatek-itemsets.Thisgeneral-to-

(k−1)

specificsearchstrategyiseffective,providedthemaximumlengthofafrequentitemsetisnottoolong.TheconfigurationoffrequentitemsetsthatworksbestwiththisstrategyisshowninFigure5.19(a) ,wherethedarkernodesrepresentinfrequentitemsets.Alternatively,aspecificto-generalsearchstrategylooksformorespecificfrequentitemsetsfirst,beforefindingthemoregeneralfrequentitemsets.Thisstrategyisusefultodiscovermaximalfrequentitemsetsindensetransactions,wherethefrequentitemsetborderislocatednearthebottomofthelattice,asshowninFigure5.19(b) .TheAprioriprinciplecanbeappliedtopruneallsubsetsofmaximalfrequentitemsets.Specifically,ifacandidatek-itemsetismaximalfrequent,wedonothavetoexamineanyofitssubsetsofsize.However,ifthecandidatek-itemsetisinfrequent,weneedtocheckall

ofits subsetsinthenextiteration.Anotherapproachistocombinebothgeneral-to-specificandspecific-to-generalsearchstrategies.Thisbidirectionalapproachrequiresmorespacetostorethecandidateitemsets,butitcanhelptorapidlyidentifythefrequentitemsetborder,giventheconfigurationshowninFigure5.19(c) .

Figure5.19.General-to-specific,specific-to-general,andbidirectionalsearch.

k−1

k−1

EquivalenceClasses:Anotherwaytoenvisionthetraversalistofirstpartitionthelatticeintodisjointgroupsofnodes(orequivalenceclasses).Afrequentitemsetgenerationalgorithmsearchesforfrequentitemsetswithinaparticularequivalenceclassfirstbeforemovingtoanotherequivalenceclass.Asanexample,thelevel-wisestrategyusedintheApriorialgorithmcanbeconsideredtobepartitioningthelatticeonthebasisofitemsetsizes;i.e.,thealgorithmdiscoversallfrequent1-itemsetsfirstbeforeproceedingtolarger-sizeditemsets.Equivalenceclassescanalsobedefinedaccordingtotheprefixorsuffixlabelsofanitemset.Inthiscase,twoitemsetsbelongtothesameequivalenceclassiftheyshareacommonprefixorsuffixoflengthk.Intheprefix-basedapproach,thealgorithmcansearchforfrequentitemsetsstartingwiththeprefixabeforelookingforthosestartingwithprefixesb,c,andsoon.Bothprefix-basedandsuffix-basedequivalenceclassescanbedemonstratedusingthetree-likestructureshowninFigure5.20 .

Figure5.20.Equivalenceclassesbasedontheprefixandsuffixlabelsofitemsets.

Breadth-FirstversusDepth-First:TheApriorialgorithmtraversesthelatticeinabreadth-firstmanner,asshowninFigure5.21(a) .Itfirstdiscoversallthefrequent1-itemsets,followedbythefrequent2-itemsets,andsoon,untilnonewfrequentitemsetsaregenerated.Theitemsetlatticecanalsobetraversedinadepth-firstmanner,asshowninFigures5.21(b) and5.22 .Thealgorithmcanstartfrom,say,nodeainFigure5.22 ,andcountitssupporttodeterminewhetheritisfrequent.Ifso,thealgorithmprogressivelyexpandsthenextlevelofnodes,i.e.,ab,abc,andsoon,untilaninfrequentnodeisreached,say,abcd.Itthenbacktrackstoanotherbranch,say,abce,andcontinuesthesearchfromthere.

Figure5.21.Breadth-firstanddepth-firsttraversals.

Figure5.22.Generatingcandidateitemsetsusingthedepth-firstapproach.

Thedepth-firstapproachisoftenusedbyalgorithmsdesignedtofindmaximalfrequentitemsets.Thisapproachallowsthefrequentitemsetbordertobedetectedmorequicklythanusingabreadth-firstapproach.

Onceamaximalfrequentitemsetisfound,substantialpruningcanbeperformedonitssubsets.Forexample,ifthenodebcdeshowninFigure5.22 ismaximalfrequent,thenthealgorithmdoesnothavetovisitthesubtreesrootedatbd,be,c,d,andebecausetheywillnotcontainanymaximalfrequentitemsets.However,ifabcismaximalfrequent,onlythenodessuchasacandbcarenotmaximalfrequent(butthesubtreesofacandbcmaystillcontainmaximalfrequentitemsets).Thedepth-firstapproachalsoallowsadifferentkindofpruningbasedonthesupportofitemsets.Forexample,supposethesupportfor{a,b,c}isidenticaltothesupportfor{a,b}.Thesubtreesrootedatabdandabecanbeskipped

becausetheyareguaranteednottohaveanymaximalfrequentitemsets.Theproofofthisisleftasanexercisetothereaders.

RepresentationofTransactionDataSet

Therearemanywaystorepresentatransactiondataset.ThechoiceofrepresentationcanaffecttheI/Ocostsincurredwhencomputingthesupportofcandidateitemsets.Figure5.23 showstwodifferentwaysofrepresentingmarketbaskettransactions.Therepresentationontheleftiscalledahorizontaldatalayout,whichisadoptedbymanyassociationruleminingalgorithms,includingApriori.Anotherpossibilityistostorethelistoftransactionidentifiers(TID-list)associatedwitheachitem.Sucharepresentationisknownastheverticaldatalayout.ThesupportforeachcandidateitemsetisobtainedbyintersectingtheTID-listsofitssubsetitems.ThelengthoftheTID-listsshrinksasweprogresstolargersizeditemsets.However,oneproblemwiththisapproachisthattheinitialsetofTID-listsmightbetoolargetofitintomainmemory,thusrequiringmoresophisticatedtechniquestocompresstheTID-lists.WedescribeanothereffectiveapproachtorepresentthedatainthenextSection.

Figure5.23.Horizontalandverticaldataformat.

HorizontalDataLayout

5.6FP-GrowthAlgorithm*ThisSectionpresentsanalternativealgorithmcalledFP-growththattakesaradicallydifferentapproachtodiscoveringfrequentitemsets.Thealgorithmdoesnotsubscribetothegenerate-and-testparadigmofApriori.Instead,itencodesthedatasetusingacompactdatastructurecalledanFP-treeandextractsfrequentitemsetsdirectlyfromthisstructure.Thedetailsofthisapproacharepresentednext.

5.6.1FP-TreeRepresentation

AnFP-treeisacompressedrepresentationoftheinputdata.ItisconstructedbyreadingthedatasetonetransactionatatimeandmappingeachtransactionontoapathintheFP-tree.Asdifferenttransactionscanhaveseveralitemsincommon,theirpathsmightoverlap.Themorethepathsoverlapwithoneanother,themorecompressionwecanachieveusingtheFP-treestructure.IfthesizeoftheFP-treeissmallenoughtofitintomainmemory,thiswillallowustoextractfrequentitemsetsdirectlyfromthestructureinmemoryinsteadofmakingrepeatedpassesoverthedatastoredondisk.

Figure5.24 showsadatasetthatcontainstentransactionsandfiveitems.ThestructuresoftheFP-treeafterreadingthefirstthreetransactionsarealsodepictedinthediagram.Eachnodeinthetreecontainsthelabelofanitemalongwithacounterthatshowsthenumberoftransactionsmappedontothegivenpath.Initially,theFP-treecontainsonlytherootnoderepresentedbythenullsymbol.TheFP-treeissubsequentlyextendedinthefollowingway:

Figure5.24.ConstructionofanFP-tree.

1. Thedatasetisscannedoncetodeterminethesupportcountofeachitem.Infrequentitemsarediscarded,whilethefrequentitemsaresortedindecreasingsupportcountsinsideeverytransactionofthedata

set.ForthedatasetshowninFigure5.24 ,aisthemostfrequentitem,followedbyb,c,d,ande.

2. ThealgorithmmakesasecondpassoverthedatatoconstructtheFP-tree.Afterreadingthefirsttransaction,{a,b},thenodeslabeledasaandbarecreated.Apathisthenformedfrom toencodethetransaction.Everynodealongthepathhasafrequencycountof1.

3. Afterreadingthesecondtransaction,{b,c,d},anewsetofnodesiscreatedforitemsb,c,andd.Apathisthenformedtorepresentthetransactionbyconnectingthenodes .Everynodealongthispathalsohasafrequencycountequaltoone.Althoughthefirsttwotransactionshaveanitemincommon,whichisb,theirpathsaredisjointbecausethetransactionsdonotshareacommonprefix.

4. Thethirdtransaction,{a,c,d,e},sharesacommonprefixitem(whichisa)withthefirsttransaction.Asaresult,thepathforthethirdtransaction, ,overlapswiththepathforthefirsttransaction, .Becauseoftheiroverlappingpath,thefrequencycountfornodeaisincrementedtotwo,whilethefrequencycountsforthenewlycreatednodes,c,d,ande,areequaltoone.

5. ThisprocesscontinuesuntileverytransactionhasbeenmappedontooneofthepathsgivenintheFP-tree.TheresultingFP-treeafterreadingallthetransactionsisshownatthebottomofFigure5.24 .

ThesizeofanFP-treeistypicallysmallerthanthesizeoftheuncompresseddatabecausemanytransactionsinmarketbasketdataoftenshareafewitemsincommon.Inthebest-casescenario,whereallthetransactionshavethesamesetofitems,theFP-treecontainsonlyasinglebranchofnodes.Theworst-casescenariohappenswheneverytransactionhasauniquesetofitems.Asnoneofthetransactionshaveanyitemsincommon,thesizeoftheFP-treeiseffectivelythesameasthesizeoftheoriginaldata.However,thephysicalstoragerequirementfortheFP-treeishigherbecauseitrequiresadditionalspacetostorepointersbetweennodesandcountersforeachitem.

→a→b

→b→c→d

→a→c→d→e→a→b

ThesizeofanFP-treealsodependsonhowtheitemsareordered.Thenotionoforderingitemsindecreasingorderofsupportcountsreliesonthepossibilitythatthehighsupportitemsoccurmorefrequentlyacrossallpathsandhencemustbeusedasmostcommonlyoccurringprefixes.Forexample,iftheorderingschemeintheprecedingexampleisreversed,i.e.,fromlowesttohighestsupportitem,theresultingFP-treeisshowninFigure5.25 .Thetreeappearstobedenserbecausethebranchingfactorattherootnodehasincreasedfrom2to5andthenumberofnodescontainingthehighsupportitemssuchasaandbhasincreasedfrom3to12.Nevertheless,orderingbydecreasingsupportcountsdoesnotalwaysleadtothesmallesttree,especiallywhenthehighsupportitemsdonotoccurfrequentlytogetherwiththeotheritems.Forexample,supposeweaugmentthedatasetgiveninFigure5.24 with100transactionsthatcontain{e},80transactionsthatcontain{d},60transactionsthatcontain{c},and40transactionsthatcontain{b}.Itemeisnowmostfrequent,followedbyd,c,b,anda.Withtheaugmentedtransactions,orderingbydecreasingsupportcountswillresultinanFP-treesimilartoFigure5.25 ,whileaschemebasedonincreasingsupportcountsproducesasmallerFP-treesimilartoFigure5.24(iv) .

Figure5.25.

AnFP-treerepresentationforthedatasetshowninFigure5.24 withadifferentitemorderingscheme.

AnFP-treealsocontainsalistofpointersconnectingnodesthathavethesameitems.Thesepointers,representedasdashedlinesinFigures5.24and5.25 ,helptofacilitatetherapidaccessofindividualitemsinthetree.WeexplainhowtousetheFP-treeanditscorrespondingpointersforfrequentitemsetgenerationinthenextSection.

5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm

FP-growthisanalgorithmthatgeneratesfrequentitemsetsfromanFP-treebyexploringthetreeinabottom-upfashion.GiventheexampletreeshowninFigure5.24 ,thealgorithmlooksforfrequentitemsetsendinginefirst,followedbyd,c,b,andfinally,a.Thisbottom-upstrategyforfindingfrequentitemsetsendingwithaparticularitemisequivalenttothesuffix-basedapproachdescribedinSection5.5 .SinceeverytransactionismappedontoapathintheFP-tree,wecanderivethefrequentitemsetsendingwithaparticularitem,say,e,byexaminingonlythepathscontainingnodee.Thesepathscanbeaccessedrapidlyusingthepointersassociatedwithnodee.TheextractedpathsareshowninFigure5.26(a) .Similarpathsforitemsetsendingind,c,b,andaareshowninFigures5.26(b) ,(c) ,(d) ,and(e) ,respectively.

Figure5.26.Decomposingthefrequentitemsetgenerationproblemintomultiplesubproblems,whereeachsubprobleminvolvesfindingfrequentitemsetsendingine,d,c,b,anda.

FP-growthfindsallthefrequentitemsetsendingwithaparticularsuffixbyemployingadivide-and-conquerstrategytosplittheproblemintosmallersubproblems.Forexample,supposeweareinterestedinfindingallfrequentitemsetsendingine.Todothis,wemustfirstcheckwhethertheitemset{e}itselfisfrequent.Ifitisfrequent,weconsiderthesubproblemoffindingfrequentitemsetsendinginde,followedbyce,be,andae.Inturn,eachofthesesubproblemsarefurtherdecomposedintosmallersubproblems.Bymergingthesolutionsobtainedfromthesubproblems,allthefrequentitemsetsendinginecanbefound.Finally,thesetofallfrequentitemsetscanbegeneratedbymergingthesolutionstothesubproblemsoffindingfrequent

itemsetsendingine,d,c,b,anda.Thisdivide-and-conquerapproachisthekeystrategyemployedbytheFP-growthalgorithm.

Foramoreconcreteexampleonhowtosolvethesubproblems,considerthetaskoffindingfrequentitemsetsendingwithe.

1. Thefirststepistogatherallthepathscontainingnodee.TheseinitialpathsarecalledprefixpathsandareshowninFigure5.27(a) .

Figure5.27.ExampleofapplyingtheFP-growthalgorithmtofindfrequentitemsetsendingine.

2. FromtheprefixpathsshowninFigure5.27(a) ,thesupportcountforeisobtainedbyaddingthesupportcountsassociatedwithnodee.Assumingthattheminimumsupportcountis2,{e}isdeclaredafrequentitemsetbecauseitssupportcountis3.

3. Because{e}isfrequent,thealgorithmhastosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andae.Beforesolvingthesesubproblems,itmustfirstconverttheprefixpathsintoaconditionalFP-tree,whichisstructurallysimilartoanFP-tree,exceptitisusedtofindfrequentitemsetsendingwithaparticularsuffix.AconditionalFP-treeisobtainedinthefollowingway:

a. First,thesupportcountsalongtheprefixpathsmustbeupdatedbecausesomeofthecountsincludetransactionsthatdonotcontainiteme.Forexample,therightmostpathshowninFigure5.27(a) , ,includesatransaction{b,c}thatdoesnotcontainiteme.Thecountsalongtheprefixpathmustthereforebeadjustedto1toreflecttheactualnumberoftransactionscontaining{b,c,e}.

b. Theprefixpathsaretruncatedbyremovingthenodesfore.Thesenodescanberemovedbecausethesupportcountsalongtheprefixpathshavebeenupdatedtoreflectonlytransactionsthatcontaineandthesubproblemsoffindingfrequentitemsetsendinginde,ce,be,andaenolongerneedinformationaboutnodee.

c. Afterupdatingthesupportcountsalongtheprefixpaths,someoftheitemsmaynolongerbefrequent.Forexample,thenodebappearsonlyonceandhasasupportcountequalto1,whichmeansthatthereisonlyonetransactionthatcontainsbothbande.Itembcanbesafelyignoredfromsubsequentanalysisbecauseallitemsetsendinginbemustbeinfrequent.

TheconditionalFP-treeforeisshowninFigure5.27(b) .Thetreelooksdifferentthantheoriginalprefixpathsbecausethefrequencycountshavebeenupdatedandthenodesbandehavebeeneliminated.

4. FP-growthusestheconditionalFP-treeforetosolvethesubproblemsoffindingfrequentitemsetsendinginde,ce,andae.Tofindthefrequentitemsetsendinginde,theprefixpathsfordaregatheredfromtheconditionalFP-treefore(Figure5.27(c) ).Byaddingthefrequencycountsassociatedwithnoded,weobtainthesupportcountfor{d,e}.Sincethesupportcountisequalto2,{d,e}isdeclaredafrequentitemset.Next,thealgorithmconstructstheconditionalFP-treefordeusingtheapproachdescribedinstep3.Afterupdatingthesupportcountsandremovingtheinfrequentitemc,theconditionalFP-treefordeisshowninFigure5.27(d) .SincetheconditionalFP-treecontainsonlyoneitem,a,whosesupportisequaltominsup,thealgorithmextractsthefrequentitemset{a,d,e}andmovesontothenextsubproblem,whichistogeneratefrequentitemsetsendingince.Afterprocessingtheprefixpathsforc,{c,e}isfoundtobefrequent.However,theconditionalFP-treeforcewillhavenofrequentitemsandthuswillbeeliminated.Thealgorithmproceedstosolvethenextsubproblemandfinds{a,e}tobetheonlyfrequentitemsetremaining.

Thisexampleillustratesthedivide-and-conquerapproachusedintheFP-growthalgorithm.Ateachrecursivestep,aconditionalFP-treeisconstructedbyupdatingthefrequencycountsalongtheprefixpathsandremovingallinfrequentitems.Becausethesubproblemsaredisjoint,FP-growthwillnotgenerateanyduplicateitemsets.Inaddition,thecountsassociatedwiththenodesallowthealgorithmtoperformsupportcountingwhilegeneratingthecommonsuffixitemsets.

FP-growthisaninterestingalgorithmbecauseitillustrateshowacompactrepresentationofthetransactiondatasethelpstoefficientlygeneratefrequentitemsets.Inaddition,forcertaintransactiondatasets,FP-growthoutperformsthestandardApriorialgorithmbyseveralordersofmagnitude.Therun-timeperformanceofFP-growthdependsonthecompactionfactorofthedataset.IftheresultingconditionalFP-treesareverybushy(intheworstcase,afullprefixtree),thentheperformanceofthealgorithmdegradessignificantlybecauseithastogeneratealargenumberofsubproblemsandmergetheresultsreturnedbyeachsubproblem.

5.7EvaluationofAssociationPatternsAlthoughtheAprioriprinciplesignificantlyreducestheexponentialsearchspaceofcandidateitemsets,associationanalysisalgorithmsstillhavethepotentialtogeneratealargenumberofpatterns.Forexample,althoughthedatasetshowninTable5.1 containsonlysixitems,itcanproducehundredsofassociationrulesatparticularsupportandconfidencethresholds.Asthesizeanddimensionalityofrealcommercialdatabasescanbeverylarge,wecaneasilyendupwiththousandsorevenmillionsofpatterns,manyofwhichmightnotbeinteresting.Identifyingthemostinterestingpatternsfromthemultitudeofallpossibleonesisnotatrivialtaskbecause“oneperson'strashmightbeanotherperson'streasure.”Itisthereforeimportanttoestablishasetofwell-acceptedcriteriaforevaluatingthequalityofassociationpatterns.

Thefirstsetofcriteriacanbeestablishedthroughadata-drivenapproachtodefineobjectiveinterestingnessmeasures.Thesemeasurescanbeusedtorankpatterns—itemsetsorrules—andthusprovideastraightforwardwayofdealingwiththeenormousnumberofpatternsthatarefoundinadataset.Someofthemeasurescanalsoprovidestatisticalinformation,e.g.,itemsetsthatinvolveasetofunrelateditemsorcoververyfewtransactionsareconsidereduninterestingbecausetheymaycapturespuriousrelationshipsinthedataandshouldbeeliminated.Examplesofobjectiveinterestingnessmeasuresincludesupport,confidence,andcorrelation.

Thesecondsetofcriteriacanbeestablishedthroughsubjectivearguments.Apatternisconsideredsubjectivelyuninterestingunlessitrevealsunexpectedinformationaboutthedataorprovidesusefulknowledgethatcanleadtoprofitableactions.Forexample,therule maynotbeinteresting,despitehavinghighsupportandconfidencevalues,becausethe

{Butter}→{Bread}

relationshiprepresentedbytherulemightseemratherobvious.Ontheotherhand,therule isinterestingbecausetherelationshipisquiteunexpectedandmaysuggestanewcross-sellingopportunityforretailers.Incorporatingsubjectiveknowledgeintopatternevaluationisadifficulttaskbecauseitrequiresaconsiderableamountofpriorinformationfromdomainexperts.Readersinterestedinsubjectiveinterestingnessmeasuresmayrefertoresourceslistedinthebibliographyattheendofthischapter.

5.7.1ObjectiveMeasuresofInterestingness

Anobjectivemeasureisadata-drivenapproachforevaluatingthequalityofassociationpatterns.Itisdomain-independentandrequiresonlythattheuserspecifiesathresholdforfilteringlow-qualitypatterns.Anobjectivemeasureisusuallycomputedbasedonthefrequencycountstabulatedinacontingencytable.Table5.6 showsanexampleofacontingencytableforapairofbinaryvariables,AandB.Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum representsthesupportcountforB.Finally,eventhoughourdiscussionfocusesmainlyonasymmetricbinaryvariables,notethatcontingencytablesarealsoapplicabletootherattributetypessuchassymmetricbinary,nominal,andordinalvariables.

Table5.6.A2-waycontingencytableforvariablesAandB.

{Diapers}→{Beer}

A¯(B¯)fij 2×2

f11f01

f1+f+1

B

A

N

LimitationsoftheSupport-ConfidenceFrameworkTheclassicalassociationruleminingformulationreliesonthesupportandconfidencemeasurestoeliminateuninterestingpatterns.Thedrawbackofsupport,whichisdescribedmorefullyinSection5.8 ,isthatmanypotentiallyinterestingpatternsinvolvinglowsupportitemsmightbeeliminatedbythesupportthreshold.Thedrawbackofconfidenceismoresubtleandisbestdemonstratedwiththefollowingexample.

Example5.3.Supposeweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandcoffee.WemaygatherinformationaboutthebeveragepreferencesamongagroupofpeopleandsummarizetheirresponsesintoacontingencytablesuchastheoneshowninTable5.7 .

Table5.7.Beveragepreferencesamongagroupof1000people.

Coffee

Tea 150 50 200

650 150 800

800 200 1000

f11 f10 f1+

A¯ f01 f00 f0+

f+1 f+0

Coffee¯

Tea¯

Theinformationgiveninthistablecanbeusedtoevaluatetheassociationrule .Atfirstglance,itmayappearthatpeoplewhodrinkteaalsotendtodrinkcoffeebecausetherule'ssupport(15%)andconfidence(75%)valuesarereasonablyhigh.Thisargumentwouldhavebeenacceptableexceptthatthefractionofpeoplewhodrinkcoffee,regardlessofwhethertheydrinktea,is80%,whilethefractionofteadrinkerswhodrinkcoffeeisonly75%.Thusknowingthatapersonisateadrinkeractuallydecreasesherprobabilityofbeingacoffeedrinkerfrom80%to75%!Therule isthereforemisleadingdespiteitshighconfidencevalue.

Nowconsiderasimilarproblemwhereweareinterestedinanalyzingtherelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Table5.8 summarizestheinformationgatheredoverthesamegroupofpeopleabouttheirpreferencesfordrinkingteaandusinghoney.Ifweevaluatetheassociationrule usingthisinformation,wewillfindthattheconfidencevalueofthisruleismerely50%,whichmightbeeasilyrejectedusingareasonablethresholdontheconfidencevalue,say70%.Onethusmightconsiderthatthepreferenceofapersonfordrinkingteahasnoinfluenceonherpreferenceforusinghoney.However,thefractionofpeoplewhousehoney,regardlessofwhethertheydrinktea,isonly12%.Hence,knowingthatapersondrinksteasignificantlyincreasesherprobabilityofusinghoneyfrom12%to50%.Further,thefractionofpeoplewhodonotdrinkteabutusehoneyisonly2.5%!Thissuggeststhatthereisdefinitelysomeinformationinthepreferenceofapersonofusinghoneygiventhatshedrinkstea.Therule

maythereforebefalselyrejectedifconfidenceisusedastheevaluationmeasure.

Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.

{Tea}→{Coffee}

{Tea}→{Coffee}

{Tea}→{Honey}

{Tea}→{Honey}

Honey

Tea 100 100 200

20 780 800

120 880 1000

Notethatifwetakethesupportofcoffeedrinkersintoaccount,wewouldnotbesurprisedtofindthatmanyofthepeoplewhodrinkteaalsodrinkcoffee,sincetheoverallnumberofcoffeedrinkersisquitelargebyitself.Whatismoresurprisingisthatthefractionofteadrinkerswhodrinkcoffeeisactuallylessthantheoverallfractionofpeoplewhodrinkcoffee,whichpointstoaninverserelationshipbetweenteadrinkersandcoffeedrinkers.Similarly,ifweaccountforthefactthatthesupportofusinghoneyisinherentlysmall,itiseasytounderstandthatthefractionofteadrinkerswhousehoneywillnaturallybesmall.Instead,whatisimportanttomeasureisthechangeinthefractionofhoneyusers,giventheinformationthattheydrinktea.

Thelimitationsoftheconfidencemeasurearewell-knownandcanbeunderstoodfromastatisticalperspectiveasfollows.Thesupportofavariablemeasurestheprobabilityofitsoccurrence,whilethesupports(A,B)ofapairofavariablesAandBmeasurestheprobabilityofthetwovariablesoccurringtogether.Hence,thejointprobabilityP(A,B)canbewrittenas

IfweassumeAandBarestatisticallyindependent,i.e.thereisnorelationshipbetweentheoccurrencesofAandB,then .Hence,undertheassumptionofstatisticalindependencebetweenAandB,thesupportsindep(A,B)ofAandBcanbewrittenas

Honey¯

Tea¯

P(A,B)=s(A,B)=f11N.

P(A,B)=P(A)×P(B)

Ifthesupportbetweentwovariables,s(A,B)isequalto ,thenAandBcanbeconsideredtobeunrelatedtoeachother.However,ifs(A,B)iswidelydifferentfrom ,thenAandBaremostlikelydependent.Hence,anydeviationofs(A,B)from canbeseenasanindicationofastatisticalrelationshipbetweenAandB.Sincetheconfidencemeasureonlyconsidersthedevianceofs(A,B)froms(A)andnotfrom ,itfailstoaccountforthesupportoftheconsequent,namelys(B).Thisresultsinthedetectionofspuriouspatterns(e.g., )andtherejectionoftrulyinterestingpatterns(e.g., ),asillustratedinthepreviousexample.

Variousobjectivemeasureshavebeenusedtocapturethedevianceofs(A,B)from ,thatarenotsusceptibletothelimitationsoftheconfidencemeasure.Below,weprovideabriefdescriptionofsomeofthesemeasuresanddiscusssomeoftheirproperties.

InterestFactorTheinterestfactor,whichisalsocalledasthe“lift,”canbedefinedasfollows:

Noticethat .Hence,theinterestfactormeasurestheratioofthesupportofapatterns(A,B)againstitsbaselinesupport (A,B)computedunderthestatisticalindependenceassumption.UsingEquations5.5 and5.4 ,wecaninterpretthemeasureasfollows:

sindep(A,B)=s(A)×s(B)orequivalently,sindep(A,B)=f1+N×f+1N. (5.4)

sindep(A,B)

sindep(A,B)s(A)×s(B)

s(A)×s(B)

{Tea}→{Coffee}{Tea}→{Honey}

sindep(A,B)

I(A,B)=s(A,B)s(A)×s(B)=Nf11f1+f+1. (5.5)

s(A)×s(B)=sindep(A,B)sindep

I(A,B)={=1,ifAandBareindependent;>1,ifAandBarepositivelyrelated;<1,ifAandBarenegativelyrelated.

(5.6)

Forthetea-coffeeexampleshowninTable5.7 , ,thussuggestingaslightnegativerelationshipbetweenteadrinkersandcoffeedrinkers.Also,forthetea-honeyexampleshowninTable5.8 ,

,suggestingastrongpositiverelationshipbetweenpeoplewhodrinkteaandpeoplewhousehoneyintheirbeverage.Wecanthusseethattheinterestfactorisabletodetectmeaningfulpatternsinthetea-coffeeandtea-honeyexamples.Indeed,theinterestfactorhasanumberofstatisticaladvantagesovertheconfidencemeasurethatmakeitasuitablemeasureforanalyzingstatisticalindependencebetweenvariables.

Piatesky-Shapiro(PS)MeasureInsteadofcomputingtheratiobetweens(A,B)and ,thePSmeasureconsidersthedifferencebetweens(A,B)and asfollows.

ThePSvalueis0whenAandBaremutuallyindependentofeachother.Otherwise, whenthereisapositiverelationshipbetweenthetwovariables,and whenthereisanegativerelationship.

CorrelationAnalysisCorrelationanalysisisoneofthemostpopulartechniquesforanalyzingrelationshipsbetweenapairofvariables.Forcontinuousvariables,correlationisdefinedusingPearson'scorrelationcoefficient(seeEquation2.10 onpage83).Forbinaryvariables,correlationcanbemeasuredusingthe

,whichisdefinedas

I=0.150.2×0.8=0.9375

I=0.10.12×0.2=4.1667

sindep(A,B)=s(A)×s(B)s(A)×s(B)

PS=s(A,B)−s(A)×s(B)=f11N−f1+f+1N2 (5.7)

PS>0PS<0

ϕ-coefficient

ϕ=f11f00−f01f10f1+f+1f0+f+0. (5.8)

Ifwerearrangethetermsin5.8,wecanshowthatthe canberewrittenintermsofthesupportmeasuresofA,B,and{A,B}asfollows:

NotethatthenumeratorintheaboveequationisidenticaltothePSmeasure.Hence,the canbeunderstoodasanormalizedversionofthePSmeasure,wherethatthevalueofthe rangesfrom to .Fromastatisticalviewpoint,thecorrelationcapturesthenormalizeddifferencebetweens(A,B)and (A,B).Acorrelationvalueof0meansnorelationship,whileavalueof suggestsaperfectpositiverelationshipandavalueof suggestsaperfectnegativerelationship.Thecorrelationmeasurehasastatisticalmeaningandhenceiswidelyusedtoevaluatethestrengthofstatisticalindependenceamongvariables.Forinstance,thecorrelationbetweenteaandcoffeedrinkersinTable5.7 is whichisslightlylessthan0.Ontheotherhand,thecorrelationbetweenpeoplewhodrinkteaandpeoplewhousehoneyinTable5.8 is0.5847,suggestingapositiverelationship.

ISMeasureISisanalternativemeasureforcapturingtherelationshipbetweens(A,B)and

.TheISmeasureisdefinedasfollows:

AlthoughthedefinitionofISlooksquitesimilartotheinterestfactor,theysharesomeinterestingdifferences.SinceISisthegeometricmeanbetweentheinterestfactorandthesupportofapattern,ISislargewhenboththeinterestfactorandsupportarelarge.Hence,iftheinterestfactoroftwopatternsareidentical,theIShasapreferenceofselectingthepatternwithhighersupport.ItisalsopossibletoshowthatISismathematicallyequivalent

ϕ-coefficient

ϕ=s(A,B)−s(A)×s(B)s(A)×(1−s(A))×s(B)×(1−s(B)). (5.9)

ϕ-coefficientϕ-coefficient −1 +1

sindep+1

−1

−0.0625

s(A)×s(B)

IS(A,B)=I(A,B)×s(A,B)=s(A,B)s(A)s(B)=f11f1+f+1. (5.10)

tothecosinemeasureforbinaryvariables(seeEquation2.6 onpage81 ).ThevalueofISthusvariesfrom0to1,whereanISvalueof0correspondstonoco-occurrenceofthetwovariables,whileanISvalueof1denotesperfectrelationship,sincetheyoccurinexactlythesametransactions.Forthetea-coffeeexampleshowninTable5.7 ,thevalueofISisequalto0.375,whilethevalueofISforthetea-honeyexampleinTable5.8 is0.6455.TheISmeasurethusgivesahighervalueforthe

rulethanthe rule,whichisconsistentwithourunderstandingofthetworules.

AlternativeObjectiveInterestingnessMeasuresNotethatallofthemeasuresdefinedinthepreviousSectionusedifferenttechniquestocapturethedeviancebetweens(A,B)and .Somemeasuresusetheratiobetweens(A,B)and (A,B),e.g.,theinterestfactorandIS,whilesomeothermeasuresconsiderthedifferencebetweenthetwo,e.g.,thePSandthe .Somemeasuresareboundedinaparticularrange,e.g.,theISandthe ,whileothersareunboundedanddonothaveadefinedmaximumorminimumvalue,e.g.,theInterestFactor.Becauseofsuchdifferences,thesemeasuresbehavedifferentlywhenappliedtodifferenttypesofpatterns.Indeed,themeasuresdefinedabovearenotexhaustiveandthereexistmanyalternativemeasuresforcapturingdifferentpropertiesofrelationshipsbetweenpairsofbinaryvariables.Table5.9 providesthedefinitionsforsomeofthesemeasuresintermsofthefrequencycountsofa contingencytable.

Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.

Measure(Symbol) Definition

Correlation

{Tea}→{Honey} {Tea}→{Coffee}

sindep=s(A)×s(B)sindep

ϕ-coefficientϕ-coefficient

2×2

(ϕ) Nf11−f1+f+1f1+f+1f0+f+0

Oddsratio

Kappa

Interest(I)

Cosine(IS)

Piatetsky-Shapiro(PS)

Collectivestrength(S)

Jaccard

All-confidence(h)

ConsistencyamongObjectiveMeasuresGiventhewidevarietyofmeasuresavailable,itisreasonabletoquestionwhetherthemeasurescanproducesimilarorderingresultswhenappliedtoasetofassociationpatterns.Ifthemeasuresareconsistent,thenwecanchooseanyoneofthemasourevaluationmetric.Otherwise,itisimportanttounderstandwhattheirdifferencesareinordertodeterminewhichmeasureismoresuitableforanalyzingcertaintypesofpatterns.

SupposethemeasuresdefinedinTable5.9 areappliedtorankthetencontingencytablesshowninTable5.10 .Thesecontingencytablesarechosentoillustratethedifferencesamongtheexistingmeasures.TheorderingproducedbythesemeasuresisshowninTable5.11 (with1asthemostinterestingand10astheleastinterestingtable).Althoughsomeofthemeasuresappeartobeconsistentwitheachother,othersproducequitedifferentorderingresults.Forexample,therankingsgivenbytheagreesmostlywiththoseprovidedby andcollectivestrength,butarequite

(α) (f11f00)/(f10f01)

(κ) Nf11+Nf00−f1+f+1−f0+f+0N2−f1+f+1−f0+f+0

(Nf11)/(f1+f+1)

(f11)/(f1+f+1)

f11N−f1+f+1N2

f11+f00f1+f+1+f0+f+0×N−f1+f+1−f0+f+0N−f11−f00

(ζ) f11/(f1++f+1−f11)

min[f11f1+,f11f+1]

ϕ-coefficientκ

differentthantherankingsproducedbyinterestfactor.Furthermore,acontingencytablesuchas isrankedlowestaccordingtothe ,buthighestaccordingtointerestfactor.

Table5.10.Exampleofcontingencytables.

Example

8123 83 424 1370

8330 2 622 1046

3954 3080 5 2961

2886 1363 1320 4431

1500 2000 500 6000

4000 2000 1000 3000

9481 298 127 94

4000 2000 2000 2000

7450 2483 4 63

61 2483 4 7452

Table5.11.RankingsofcontingencytablesusingthemeasuresgiveninTable5.9 .

I IS PS S h

1 3 1 6 2 2 1 2 2

2 1 2 7 3 5 2 3 3

E10 ϕ-coefficient

f11 f10 f01 f00

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

ϕ α κ ζ

E1

E2

3 2 4 4 5 1 3 6 8

4 8 3 3 7 3 4 7 5

5 7 6 2 9 6 6 9 9

6 9 5 5 6 4 5 5 7

7 6 7 9 1 8 7 1 1

8 10 8 8 8 7 8 8 7

9 4 9 10 4 9 9 4 4

10 5 10 1 10 10 10 10 10

PropertiesofObjectiveMeasuresTheresultsshowninTable5.11 suggestthatthemeasuresgreatlydifferfromeachotherandcanprovideconflictinginformationaboutthequalityofapattern.Infact,nomeasureisuniversallybestforallapplications.Inthefollowing,wedescribesomepropertiesofthemeasuresthatplayanimportantroleindeterminingiftheyaresuitedforacertainapplication.

InversionProperty

ConsiderthebinaryvectorsshowninFigure5.28 .The0/1valueineachcolumnvectorindicateswhetheratransaction(row)containsaparticularitem(column).Forexample,thevectorAindicatesthattheitemappearsinthefirstandlasttransactions,whereasthevectorBindicatesthattheitemiscontainedonlyinthefifthtransaction.Thevectors and aretheinvertedversionsofAandB,i.e.,their1valueshavebeenchangedto0values(absencetopresence)andviceversa.Applyingthistransformationtoabinary

E3

E4

E5

E6

E7

E8

E9

E10

A¯ B¯

vectoriscalledinversion.Ifameasureisinvariantundertheinversionoperation,thenitsvalueforthevectorpair shouldbeidenticaltoitsvaluefor{A,B}.Theinversionpropertyofameasurecanbetestedasfollows.

Figure5.28.Effectoftheinversionoperation.Thevectors and areinversionsofvectorsAandB,respectively.

Definition5.6.(InversionProperty.)AnobjectivemeasureMisinvariantundertheinversionoperationifitsvalueremainsthesamewhenexchangingthefrequencycounts with and with .

Measuresthatareinvarianttotheinversionpropertyincludethecorrelation( ),oddsratio, ,andcollectivestrength.Thesemeasuresareespeciallyusefulinscenarioswherethepresence(1's)ofavariableisas

{A¯,B¯}

A¯ E¯

f11 f00 f10 f01

ϕ-coefficient κ

importantasitsabsence(0's).Forexample,ifwecomparetwosetsofanswerstoaseriesoftrue/falsequestionswhere0's(true)and1's(false)areequallymeaningful,weshoulduseameasurethatgivesequalimportancetooccurrencesof0–0'sand1–1's.ForthevectorsshowninFigure5.28 ,the

isequalto-0.1667regardlessofwhetherweconsiderthepair{A,B}orpair .Similarly,theoddsratioforbothpairsofvectorsisequaltoaconstantvalueof0.Notethateventhoughthe andtheoddsratioareinvarianttoinversion,theycanstillshowdifferentresults,aswillbeshownlater.

MeasuresthatdonotremaininvariantundertheinversionoperationincludetheinterestfactorandtheISmeasure.Forexample,theISvalueforthepair

inFigure5.28 is0.825,whichreflectsthefactthatthe1'sinand occurfrequentlytogether.However,theISvalueofitsinvertedpair{A,B}isequalto0,sinceAandBdonothaveanyco-occurrenceof1's.Forasymmetricbinaryvariables,e.g.,theoccurrenceofwordsindocuments,thisisindeedthedesiredbehavior.Adesiredsimilaritymeasurebetweenasymmetricvariablesshouldnotbeinvarianttoinversion,sinceforthesevariables,itismoremeaningfultocapturerelationshipsbasedonthepresenceofavariableratherthanitsabsence.Ontheotherhand,ifwearedealingwithsymmetricbinaryvariableswheretherelationshipsbetween0'sand1'sareequallymeaningful,careshouldbetakentoensurethatthechosenmeasureisinvarianttoinversion.

AlthoughthevaluesoftheinterestfactorandISchangewiththeinversionoperation,theycanstillbeinconsistentwitheachother.Toillustratethis,considerTable5.12 ,whichshowsthecontingencytablesfortwopairsofvariables,{p,q}and{r,s}.Notethatrandsareinvertedtransformationsofpandq,respectively,wheretherolesof0'sand1'shavejustbeenreversed.Theinterestfactorfor{p,q}is1.02andfor{r,s}is4.08,whichmeansthattheinterestfactorfindstheinvertedpair{r,s}morerelatedthantheoriginalpair

ϕ-coefficient{A¯,B¯}

ϕ-coefficient

{A¯,B¯} A¯B¯

{p,q}.Onthecontrary,theISvaluedecreasesuponinversionfrom0.9346for{p,q}to0.286for{r,s},suggestingquiteanoppositetrendtothatoftheinterestfactor.Eventhoughthesemeasuresconflictwitheachotherforthisexample,theymaybetherightchoiceofmeasureindifferentapplications.

Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.

p

q 880 50 930

50 20 70

930 70 1000

r

s 20 50 70

50 880 930

70 930 1000

ScalingProperty

Table5.13 showstwocontingencytablesforgenderandthegradesachievedbystudentsenrolledinaparticularcourse.Thesetablescanbeusedtostudytherelationshipbetweengenderandperformanceinthecourse.Thesecondcontingencytablehasdatafromthesamepopulationbuthastwiceasmanymalesandthreetimesasmanyfemales.Theactualnumberofmalesorfemalescandependuponthesamplesavailableforstudy,buttherelationshipbetweengenderandgradeshouldnotchangejustbecauseofdifferencesinsamplesizes.Similarly,ifthenumberofstudentswithhighandlowgradesarechangedinanewstudy,ameasureofassociationbetween

genderandgradesshouldremainunchanged.Hence,weneedameasurethatisinvarianttoscalingofrowsorcolumns.Theprocessofmultiplyingaroworcolumnofacontingencytablebyaconstantvalueiscalledaroworcolumnscalingoperation.Ameasurethatisinvarianttoscalingdoesnotchangeitsvalueafteranyroworcolumnscalingoperation.

Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.

Male Female

High 30 20 50

Low 40 10 50

70 30 100

(b)Sampledataofsize230.

Male Female

High 60 60 120

Low 80 30 110

140 90 230

Definition5.7.(ScalingInvarianceProperty.)LetTbeacontingencytablewithfrequencycounts

.Let bethetransformedacontingencytable[f11;f10;f01;f00] T′

withscaledfrequencycounts,where are

positiveconstantsusedtoscalethetworowsandthetwocolumnsofT.AnobjectivemeasureMisinvariantundertherow/columnscalingoperationif .

Notethattheuseoftheterm‘scaling'hereshouldnotbeconfusedwiththescalingoperationforcontinuousvariablesintroducedinChapter2 onpage23,whereallthevaluesofavariablewerebeingmultipliedbyaconstantfactor,insteadofscalingaroworcolumnofacontingencytable.

Scalingofrowsandcolumnsincontingencytablesoccursinmultiplewaysindifferentapplications.Forexample,ifwearemeasuringtheeffectofaparticularmedicalprocedureontwosetsofsubjects,healthyanddiseased,theratioofhealthyanddiseasedsubjectscanwidelyvaryacrossdifferentstudiesinvolvingdifferentgroupsofparticipants.Further,thefractionofhealthyanddiseasedsubjectschosenforacontrolledstudycanbequitedifferentfromthetruefractionobservedinthecompletepopulation.Thesedifferencescanresultinaroworcolumnscalinginthecontingencytablesfordifferentpopulationsofsubjects.Ingeneral,thefrequenciesofitemsinacontingencytablecloselydependsonthesampleoftransactionsusedtogeneratethetable.Anychangeinthesamplingproceduremayaffectaroworcolumnscalingtransformation.Ameasurethatisexpectedtobeinvarianttodifferencesinthesamplingproceduremustnotchangewithroworcolumnscaling.

OfallthemeasuresintroducedinTable5.9 ,onlytheoddsratio isinvarianttorowandcolumnscalingoperations.Forexample,thevalueofoddsratioforboththetablesinTable5.13 isequalto0.375.Allother

[k1k3f11;k2k3f10;k1k4f01;k2k4f00] k1,k2,k3,k4

M(T)=M(T′)

(α)

measuressuchasthe ,IS,interestfactor,andcollectivestrength(S)changetheirvalueswhentherowsandcolumnsofthecontingencytablearerescaled.Indeed,theoddsratioisapreferredchoiceofmeasureinthemedicaldomain,whereitisimportanttofindrelationshipsthatdonotchangewithdifferencesinthepopulationsamplechosenforastudy.

NullAdditionProperty

Supposeweareinterestedinanalyzingtherelationshipbetweenapairofwords,suchas and ,inasetofdocuments.Ifacollectionofarticlesabouticefishingisaddedtothedataset,shouldtheassociationbetween and beaffected?Thisprocessofaddingunrelateddata(inthiscase,documents)toagivendatasetisknownasthenulladditionoperation.

Definition5.8.(NullAdditionProperty.)AnobjectivemeasureMisinvariantunderthenulladditionoperationifitisnotaffectedbyincreasing ,whileallotherfrequenciesinthecontingencytablestaythesame.

Forapplicationssuchasdocumentanalysisormarketbasketanalysis,wewouldliketouseameasurethatremainsinvariantunderthenulladditionoperation.Otherwise,therelationshipbetweenwordscanbemadetochangesimplybyaddingenoughdocumentsthatdonotcontainbothwords!Examplesofmeasuresthatsatisfythispropertyincludecosine(IS)and

ϕ-coefficient,κ

f00

Jaccard measures,whilethosethatviolatethispropertyincludeinterestfactor,PS,oddsratio,andthe .

Todemonstratetheeffectofnulladdition,considerthetwocontingencytablesand showninTable5.14 .Table hasbeenobtainedfrom by

adding1000extratransactionswithbothAandBabsent.Thisoperationonlyaffectsthe entryofTable ,whichhasincreasedfrom100to1100,whereasalltheotherfrequenciesinthetable ,and remainthesame.SinceISisinvarianttonulladdition,itgivesaconstantvalueof0.875toboththetables.However,theadditionof1000extratransactionswithoccurrencesof0–0'schangesthevalueofinterestfactorfrom0.972for(denotingaslightlynegativecorrelation)to1.944for (positivecorrelation).Similarly,thevalueofoddsratioincreasesfrom7for to77for .Hence,whentheinterestfactororoddsratioareusedastheassociationmeasure,therelationshipsbetweenvariableschangesbytheadditionofnulltransactionswhereboththevariablesareabsent.Incontrast,theISmeasureisinvarianttonulladdition,sinceitconsiderstwovariablestoberelatedonlyiftheyfrequentlyoccurtogether.Indeed,theISmeasure(cosinemeasure)iswidelyusedtomeasuresimilarityamongdocuments,whichisexpectedtodependonlyonthejointoccurrences(1's)ofwordsindocuments,butnottheirabsences(0's).

Table5.14.Anexampledemonstratingtheeffectofnulladdition.(a)Table .

B

A 700 100 800

100 100 200

800 200 1000

(ξ)ϕ-coefficient

T1 T2 T2 T1

f00 T2(f11,f10 f01)

T1T2T1 T2

T1

(b)Table .

B

A 700 100 800

10 1100 1200

800 1200 2000

Table5.15 providesasummaryofpropertiesforthemeasuresdefinedinTable5.9 .Eventhoughthislistofpropertiesisnotexhaustive,itcanserveasausefulguideforselectingtherightchoiceofmeasureforanapplication.Ideally,ifweknowthespecificrequirementsofacertainapplication,wecanensurethattheselectedmeasureshowspropertiesthatadheretothoserequirements.Forexample,ifwearedealingwithasymmetricvariables,wewouldprefertouseameasurethatisnotinvarianttonulladditionorinversion.Ontheotherhand,ifwerequirethemeasuretoremaininvarianttochangesinthesamplesize,wewouldliketouseameasurethatdoesnotchangewithscaling.

Table5.15.Propertiesofsymmetricmeasures.

Symbol Measure Inversion NullAddition Scaling

Yes No No

oddsratio Yes No Yes

Cohen's Yes No No

I Interest No No No

IS Cosine No Yes No

PS Piatetsky-Shapiro's Yes No No

T2

ϕ ϕ-coefficient

α

κ

S Collectivestrength Yes No No

Jaccard No Yes No

h All-confidence No Yes No

s Support No No No

AsymmetricInterestingnessMeasuresNotethatinthediscussionsofar,wehaveonlyconsideredmeasuresthatdonotchangetheirvaluewhentheorderofthevariablesarereversed.Morespecifically,ifMisameasureandAandBaretwovariables,thenM(A,B)isequaltoM(B,A)iftheorderofthevariablesdoesnotmatter.Suchmeasuresarecalledsymmetric.Ontheotherhand,measuresthatdependontheorderofvariables arecalledasymmetricmeasures.Forexample,theinterestfactorisasymmetricmeasurebecauseitsvalueisidenticalfortherules and .Incontrast,confidenceisanasymmetricmeasuresincetheconfidencefor and maynotbethesame.Notethattheuseoftheterm‘asymmetric'todescribeaparticulartypeofmeasureofrelationship—oneinwhichtheorderofthevariablesisimportant—shouldnotbeconfusedwiththeuseof‘asymmetric'todescribeabinaryvariableforwhichonly1'sareimportant.Asymmetricmeasuresaremoresuitableforanalyzingassociationrules,sincetheitemsinaruledohaveaspecificorder.Eventhoughweonlyconsideredsymmetricmeasurestodiscussthedifferentpropertiesofassociationmeasures,theabovediscussionisalsorelevantfortheasymmetricmeasures.SeeBibliographicNotesformoreinformationaboutdifferentkindsofasymmetricmeasuresandtheirproperties.

ζ

(M(A,B)≠M(B,A))

A→B B→AA→B B→A

5.7.2MeasuresbeyondPairsofBinaryVariables

ThemeasuresshowninTable5.9 aredefinedforpairsofbinaryvariables(e.g.,2-itemsetsorassociationrules).However,manyofthem,suchassupportandall-confidence,arealsoapplicabletolarger-sizeditemsets.Othermeasures,suchasinterestfactor,IS,PS,andJaccardcoefficient,canbeextendedtomorethantwovariablesusingthefrequencytablestabulatedinamultidimensionalcontingencytable.Anexampleofathree-dimensionalcontingencytablefora,b,andcisshowninTable5.16 .Eachentry inthistablerepresentsthenumberoftransactionsthatcontainaparticularcombinationofitemsa,b,andc.Forexample, isthenumberoftransactionsthatcontainaandc,butnotb.Ontheotherhand,amarginalfrequencysuchas isthenumberoftransactionsthatcontainaandc,irrespectiveofwhetherbispresentinthetransaction.

Table5.16.Exampleofathree-dimensionalcontingencytable.

c b

a

c b

a

fijk

f101

f1+1

f111 f101 f1+1

a¯ f011 f001 f0+1

f+11 f+01 f++1

f110 f100 f1+0

a¯ f010 f000 f0+0

Givenak-itemset ,theconditionforstatisticalindependencecanbestatedasfollows:

Withthisdefinition,wecanextendobjectivemeasuressuchasinterestfactorandPS,whicharebasedondeviationsfromstatisticalindependence,tomorethantwovariables:

Anotherapproachistodefinetheobjectivemeasureasthemaximum,minimum,oraveragevaluefortheassociationsbetweenpairsofitemsinapattern.Forexample,givenak-itemset ,wemaydefinethe

forXastheaverage betweeneverypairofitemsinX.However,becausethemeasureconsidersonlypairwise

associations,itmaynotcapturealltheunderlyingrelationshipswithinapattern.Also,careshouldbetakeninusingsuchalternatemeasuresformorethantwovariables,sincetheymaynotalwaysshowtheanti-monotonepropertyinthesamewayasthesupportmeasure,makingthemunsuitableforminingpatternsusingtheAprioriprinciple.

Analysisofmultidimensionalcontingencytablesismorecomplicatedbecauseofthepresenceofpartialassociationsinthedata.Forexample,someassociationsmayappearordisappearwhenconditioneduponthevalueofcertainvariables.ThisproblemisknownasSimpson'sparadoxandisdescribedinSection5.7.3 .Moresophisticatedstatisticaltechniquesare

f+10 f+00 f++0

{i1,i2,…,ik}

fi1i2…ik=fi1+…+×f+i2…+×…×f++…ikNk−1. (5.11)

I=Nk−1×fi1i2…ikfi1+…+×f+i2…+×…×f++…ikPS=fi1i2…ikN−fi1+…+×f+i2…+×…×f++…ikNk

X={i1,i2,…,ik} ϕ-coefficient ϕ-coefficient(ip,iq)

availabletoanalyzesuchrelationships,e.g.,loglinearmodels,butthesetechniquesarebeyondthescopeofthisbook.

5.7.3Simpson'sParadox

Itisimportanttoexercisecautionwheninterpretingtheassociationbetweenvariablesbecausetheobservedrelationshipmaybeinfluencedbythepresenceofotherconfoundingfactors,i.e.,hiddenvariablesthatarenotincludedintheanalysis.Insomecases,thehiddenvariablesmaycausetheobservedrelationshipbetweenapairofvariablestodisappearorreverseitsdirection,aphenomenonthatisknownasSimpson'sparadox.Weillustratethenatureofthisparadoxwiththefollowingexample.

Considertherelationshipbetweenthesaleofhigh-definitiontelevisions(HDTV)andexercisemachines,asshowninTable5.17 .Therule

hasaconfidenceof andtherule hasaconfidenceof

.Together,theserulessuggestthatcustomerswhobuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachinesthanthosewhodonotbuyhigh-definitiontelevisions.

Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-definitiontelevisionandexercisemachine.

BuyHDTV

BuyExerciseMachine

Yes No

Yes 99 81 180

No 54 66 120

{HDTV=Yes}→{Exercisemachine=Yes} 99/180=55%{HDTV=No}→{Exercisemachine=Yes}

54/120=45%

153 147 300

However,adeeperanalysisrevealsthatthesalesoftheseitemsdependonwhetherthecustomerisacollegestudentoraworkingadult.Table5.18summarizestherelationshipbetweenthesaleofHDTVsandexercisemachinesamongcollegestudentsandworkingadults.NoticethatthesupportcountsgiveninthetableforcollegestudentsandworkingadultssumuptothefrequenciesshowninTable5.17 .Furthermore,therearemoreworkingadultsthancollegestudentswhobuytheseitems.Forcollegestudents:

Table5.18.Exampleofathree-waycontingencytable.

CustomerGroup

BuyHDTV

BuyExerciseMachine Total

Yes No

CollegeStudents Yes 1 9 10

No 4 30 34

WorkingAdult Yes 98 72 170

No 50 36 86

whileforworkingadults:

c({HDTV=Yes}→{Exercisemachine=Yes})=1/10=10%,c({HDTV=No}→{Exercisemachine=Yes})=4/34=11.8%.

c({HDTV=Yes}→{Exercisemachine=Yes})=98/170=57.7%,c({HDTV=No}→{Exercisemachine=Yes})=50/86=58.1%.

Therulessuggestthat,foreachgroup,customerswhodonotbuyhigh-definitiontelevisionsaremorelikelytobuyexercisemachines,whichcontradictsthepreviousconclusionwhendatafromthetwocustomergroupsarepooledtogether.Evenifalternativemeasuressuchascorrelation,oddsratio,orinterestareapplied,westillfindthatthesaleofHDTVandexercisemachineispositivelyrelatedinthecombineddatabutisnegativelyrelatedinthestratifieddata(seeExercise21onpage449).ThereversalinthedirectionofassociationisknownasSimpson'sparadox.

Theparadoxcanbeexplainedinthefollowingway.First,noticethatmostcustomerswhobuyHDTVsareworkingadults.Thisisreflectedinthehighconfidenceoftherule .Second,thehighconfidenceoftherule

suggeststhatmostcustomerswhobuyexercisemachinesarealsoworkingadults.SinceworkingadultsformthelargestfractionofcustomersforbothHDTVsandexercisemachines,theybothlookrelatedandtherule turnsouttobestrongerinthecombineddatathanwhatitwouldhavebeenifthedataisstratified.Hence,customergroupactsasahiddenvariablethataffectsboththefractionofcustomerswhobuyHDTVsandthosewhobuyexercisemachines.Ifwefactorouttheeffectofthehiddenvariablebystratifyingthedata,weseethattherelationshipbetweenbuyingHDTVsandbuyingexercisemachinesisnotdirect,butshowsupasanindirectconsequenceoftheeffectofthehiddenvariable.

TheSimpson'sparadoxcanalsobeillustratedmathematicallyasfollows.Suppose

{HDTV=Yes}→{WorkingAdult}(170/180=94.4%){Exercisemachine=Yes}

→{WorkingAdult}(148/153=96.7%)

{HDTV=Yes}→{Exercisemachine=Yes}

a/b<c/dandp/q<r/s,

wherea/bandp/qmayrepresenttheconfidenceoftherule intwodifferentstrata,whilec/dandr/smayrepresenttheconfidenceoftherule

inthetwostrata.Whenthedataispooledtogether,theconfidencevaluesoftherulesinthecombineddataare and ,respectively.Simpson'sparadoxoccurswhen

thusleadingtothewrongconclusionabouttherelationshipbetweenthevariables.ThelessonhereisthatproperstratificationisneededtoavoidgeneratingspuriouspatternsresultingfromSimpson'sparadox.Forexample,marketbasketdatafromamajorsupermarketchainshouldbestratifiedaccordingtostorelocations,whilemedicalrecordsfromvariouspatientsshouldbestratifiedaccordingtoconfoundingfactorssuchasageandgender.

A→B

A¯→B(a+p)/(b+q) (c+r)/(d+s)

a+pb+q>c+rd+s,

5.8EffectofSkewedSupportDistributionTheperformancesofmanyassociationanalysisalgorithmsareinfluencedbypropertiesoftheirinputdata.Forexample,thecomputationalcomplexityoftheApriorialgorithmdependsonpropertiessuchasthenumberofitemsinthedata,theaveragetransactionwidth,andthesupportthresholdused.ThisSectionexaminesanotherimportantpropertythathassignificantinfluenceontheperformanceofassociationanalysisalgorithmsaswellasthequalityofextractedpatterns.Morespecifically,wefocusondatasetswithskewedsupportdistributions,wheremostoftheitemshaverelativelylowtomoderatefrequencies,butasmallnumberofthemhaveveryhighfrequencies.

Figure5.29.Atransactiondatasetcontainingthreeitems,p,q,andr,wherepisahighsupportitemandqandrarelowsupportitems.

Figure5.29 showsanillustrativeexampleofadatasetthathasaskewedsupportdistributionofitsitems.Whilephasahighsupportof83.3%inthedata,qandrarelow-supportitemswithasupportof16.7%.Despitetheirlowsupport,qandralwaysoccurtogetherinthelimitednumberoftransactionsthattheyappearandhencearestronglyrelated.Apatternminingalgorithmthereforeshouldreport{q,r}asinteresting.

However,notethatchoosingtherightsupportthresholdforminingitem-setssuchas{q,r}canbequitetricky.Ifwesetthethresholdtoohigh(e.g.,20%),

thenwemaymissmanyinterestingpatternsinvolvinglow-supportitemssuchas{q,r}.Conversely,settingthesupportthresholdtoolowcanbedetrimentaltothepatternminingprocessforthefollowingreasons.First,thecomputationalandmemoryrequirementsofexistingassociationanalysisalgorithmsincreaseconsiderablywithlowsupportthresholds.Second,thenumberofextractedpatternsalsoincreasessubstantiallywithlowsupportthresholds,whichmakestheiranalysisandinterpretationdifficult.Inparticular,wemayextractmanyspuriouspatternsthatrelateahigh-frequencyitemsuchasptoalow-frequencyitemsuchasq.Suchpatterns,whicharecalledcross-supportpatterns,arelikelytobespuriousbecausetheassociationbetweenpandqislargelyinfluencedbythefrequentoccurrenceofpinsteadofthejointoccurrenceofpandqtogether.Becausethesupportof{p,q}isquiteclosetothesupportof{q,r},wemayeasilyselect{p,q}ifwesetthesupportthresholdlowenoughtoinclude{q,r}.

AnexampleofarealdatasetthatexhibitsaskewedsupportdistributionisshowninFigure5.30 .Thedata,takenfromthePUMS(PublicUseMicrodataSample)censusdata,contains49,046recordsand2113asymmetricbinaryvariables.Weshalltreattheasymmetricbinaryvariablesasitemsandrecordsastransactions.Whilemorethan80%oftheitemshavesupportlessthan1%,ahandfulofthemhavesupportgreaterthan90%.Tounderstandtheeffectofskewedsupportdistributiononfrequentitemsetmining,wedividetheitemsintothreegroups, ,and ,accordingtotheirsupportlevels,asshowninTable5.19 .Wecanseethatmorethan82%ofitemsbelongto andhaveasupportlessthan1%.Inmarketbasketanalysis,suchlowsupportitemsmaycorrespondtoexpensiveproducts(suchasjewelry)thatareseldomboughtbycustomers,butwhosepatternsarestillinterestingtoretailers.Patternsinvolvingsuchlow-supportitems,thoughmeaningful,caneasilyberejectedbyafrequentpatternminingalgorithmwithahighsupportthreshold.Ontheotherhand,settingalowsupportthresholdmayresultintheextractionofspuriouspatternsthatrelateahigh-frequency

G1,G2 G3

G1

itemin toalow-frequencyitemin .Forexample,atasupportthresholdequalto0.05%,thereare18,847frequentpairsinvolvingitemsfrom and

.Outofthese,93%ofthemarecross-supportpatterns;i.e.,thepatternscontainitemsfromboth and .

Figure5.30.Supportdistributionofitemsinthecensusdataset.

Table5.19.Groupingtheitemsinthecensusdatasetbasedontheirsupportvalues.

Group

Support

NumberofItems 1735 358 20

G3 G1G1

G3G1 G3

G1 G2 G3

<1% 1%−90% >90%

Thisexampleshowsthatalargenumberofweaklyrelatedcross-supportpatternscanbegeneratedwhenthesupportthresholdissufficientlylow.Notethatfindinginterestingpatternsindatasetswithskewedsupportdistributionsisnotjustachallengeforthesupportmeasure,butsimilarstatementscanbemadeaboutmanyotherobjectivemeasuresdiscussedinthepreviousSections.Beforepresentingamethodologyforfindinginterestingpatternsandpruningspuriousones,weformallydefinetheconceptofcross-supportpatterns.

Definition5.9.(Cross-supportPattern.)Letusdefinethesupportratio,r(X),ofanitemset

as

Givenauser-specifiedthreshold ,anitemsetXisacross-supportpatternif .

Example5.4.Supposethesupportformilkis70%,whilethesupportforsugaris10%andcaviaris0.04%.Given ,thefrequentitemset{milk,sugar,caviar}isacross-supportpatternbecauseitssupportratiois

X={i1,i2,…,ik}

r(X)=min[s(i1),s(i2),…,s(ik)}max[s(i1),s(i2),…,s(ik)} (5.12)

hcr(X)<hc

hc=0.01

r=min[0.7,0.1,0.0004]max[0.7,0.1,0.0004]=0.0040.7=0.00058<0.01.

Existingmeasuressuchassupportandconfidencemaynotbesufficienttoeliminatecross-supportpatterns.Forexample,ifweassume forthedatasetpresentedinFigure5.29 ,theitemsets{p,q},{p,r},and{p,q,r}arecross-supportpatternsbecausetheirsupportratios,whichareequalto0.2,arelessthanthethreshold .However,theirsupportsarecomparabletothatof{q,r},makingitdifficulttoeliminatecross-supportpatternswithoutloosinginterestingonesusingasupport-basedpruningstrategy.Confidencepruningalsodoesnothelpbecausetheconfidenceoftherulesextractedfromcross-supportpatternscanbeveryhigh.Forexample,theconfidencefor

is80%eventhough{p,q}isacross-supportpattern.Thefactthatthecross-supportpatterncanproduceahighconfidenceruleshouldnotcomeasasurprisebecauseoneofitsitems(p)appearsveryfrequentlyinthedata.Therefore,pisexpectedtoappearinmanyofthetransactionsthatcontainq.Meanwhile,therule alsohashighconfidenceeventhough{q,r}isnotacross-supportpattern.Thisexampledemonstratesthedifficultyofusingtheconfidencemeasuretodistinguishbetweenrulesextractedfromcross-supportpatternsandinterestingpatternsinvolvingstronglyconnectedbutlow-supportitems.

Eventhoughtherule hasveryhighconfidence,noticethattherulehasverylowconfidencebecausemostofthetransactionsthatcontainp

donotcontainq.Incontrast,therule ,whichisderivedfrom{q,r},hasveryhighconfidence.Thisobservationsuggeststhatcross-supportpatternscanbedetectedbyexaminingthelowestconfidencerulethatcanbeextractedfromagivenitemset.Anapproachforfindingtherulewiththelowestconfidencegivenanitemsetcanbedescribedasfollows.

1. Recallthefollowinganti-monotonepropertyofconfidence:

hc=0.3

hc

{q}→{p}

{q}→{r}

{q}→{p} {p}→{q}

{r}→{q}

conf({i1i2}→{i3,i4,…,ik})≤conf({i1i2i3}→{i4,i5,…,ik}).

Thispropertysuggeststhatconfidenceneverincreasesasweshiftmoreitemsfromtheleft-totheright-handsideofanassociationrule.Becauseofthisproperty,thelowestconfidenceruleextractedfromafrequentitemsetcontainsonlyoneitemonitsleft-handside.Wedenotethesetofallruleswithonlyoneitemonitsleft-handsideas .

2. Givenafrequentitemset ,therule

hasthelowestconfidencein if .Thisfollowsdirectlyfromthedefinitionofconfidenceastheratiobetweentherule'ssupportandthesupportoftheruleantecedent.Hence,theconfidenceofarulewillbelowestwhenthesupportoftheantecedentishighest.

3. Summarizingthepreviouspoints,thelowestconfidenceattainablefromafrequentitemset is

Thisexpressionisalsoknownastheh-confidenceorall-confidencemeasure.Becauseoftheanti-monotonepropertyofsupport,thenumeratoroftheh-confidencemeasureisboundedbytheminimumsupportofanyitemthatappearsinthefrequentitemset.Inotherwords,theh-confidenceofanitemset mustnotexceedthefollowingexpression:

Notethattheupperboundofh-confidenceintheaboveequationisexactlysameassupportratio(r)giveninEquation5.12 .Becausethesupportratioforacross-supportpatternisalwayslessthan ,theh-confidenceofthepatternisalsoguaranteedtobelessthan .Therefore,cross-supportpatternscanbeeliminatedbyensuringthattheh-confidencevaluesforthepatternsexceed .Asafinalnote,theadvantagesofusingh-confidencego

R1{i1,i2,…,ik}

{ij}→{i1,i2,…,ij−1,ij+1,…,ik}

R1 s(ij)=max[s(i1),s(i2),…,s(ik)]

{i1,i2,…,ik}s({i1,i2,…,ik})max[s(i1),s(i2),…,s(ik)].

X={i1,i2,…,ik}

h-confidence(X)≤min[s(i1),s(i2),…,s(ik)]max[s(i1),s(i2),…,s(ik)].

hchc

hc

beyondeliminatingcross-supportpatterns.Themeasureisalsoanti-monotone,i.e.,

andthuscanbeincorporateddirectlyintotheminingalgorithm.Furthermore,h-confidenceensuresthattheitemscontainedinanitemsetarestronglyassociatedwitheachother.Forexample,supposetheh-confidenceofanitemsetXis80%.IfoneoftheitemsinXispresentinatransaction,thereisatleastan80%chancethattherestoftheitemsinXalsobelongtothesametransaction.Suchstronglyassociatedpatternsinvolvinglow-supportitemsarecalledhypercliquepatterns.

Definition5.10.(HypercliquePattern.)AnitemsetXisahypercliquepatternifh-confidence ,where isauser-specifiedthreshold.

h-confidence({i1,i2,…,ik})≥h-confidence({i1,i2,…,ik+1}),

(X)>hchc

5.9BibliographicNotesTheassociationruleminingtaskwasfirstintroducedbyAgrawaletal.[324,325]todiscoverinterestingrelationshipsamongitemsinmarketbaskettransactions.Sinceitsinception,extensiveresearchhasbeenconductedtoaddressthevariousissuesinassociationrulemining,fromitsfundamentalconceptstoitsimplementationandapplications.Figure5.31 showsataxonomyofthevariousresearchdirectionsinthisarea,whichisgenerallyknownasassociationanalysis.Asmuchoftheresearchfocusesonfindingpatternsthatappearsignificantlyofteninthedata,theareaisalsoknownasfrequentpatternmining.Adetailedreviewonsomeoftheresearchtopicsinthisareacanbefoundin[362]andin[319].

Figure5.31.Anoverviewofthevariousresearchdirectionsinassociationanalysis.

ConceptualIssues

Researchontheconceptualissuesofassociationanalysishasfocusedondevelopingatheoreticalformulationofassociationanalysisandextendingtheformulationtonewtypesofpatternsandgoingbeyondasymmetricbinaryattributes.

FollowingthepioneeringworkbyAgrawaletal.[324,325],therehasbeenavastamountofresearchondevelopingatheoreticalformulationfortheassociationanalysisproblem.In[357],Gunopoulosetal.showedtheconnectionbetweenfindingmaximalfrequentitemsetsandthehypergraphtransversalproblem.Anupperboundonthecomplexityoftheassociationanalysistaskwasalsoderived.Zakietal.[454,456]andPasquieretal.[407]haveappliedformalconceptanalysistostudythefrequentitemsetgenerationproblem.Moreimportantly,suchresearchhasledtothedevelopmentofaclassofpatternsknownasclosedfrequentitemsets[456].Friedmanetal.[355]havestudiedtheassociationanalysisprobleminthecontextofbumphuntinginmultidimensionalspace.Specifically,theyconsiderfrequentitemsetgenerationasthetaskoffindinghighdensityregionsinmultidimensionalspace.Formalizingassociationanalysisinastatisticallearningframeworkisanotheractiveresearchdirection[414,435,444]asitcanhelpaddressissuesrelatedtoidentifyingstatisticallysignificantpatternsanddealingwithuncertaindata[320,333,343].

Overtheyears,theassociationruleminingformulationhasbeenexpandedtoencompassotherrule-basedpatterns,suchas,profileassociationrules[321],cyclicassociationrules[403],fuzzyassociationrules[379],exceptionrules[431],negativeassociationrules[336,418],weightedassociationrules[338,413],dependencerules[422],peculiarrules[462],inter-transactionassociationrules[353,440],andpartialclassificationrules[327,397].Additionally,theconceptoffrequentitemsethasbeenextendedtoothertypesofpatternsincludingcloseditemsets[407,456],maximalitemsets[330],hypercliquepatterns[449],supportenvelopes[428],emergingpatterns[347],

contrastsets[329],high-utilityitemsets[340,390],approximateorerror-tolerantitem-sets[358,389,451],anddiscriminativepatterns[352,401,430].Associationanalysistechniqueshavealsobeensuccessfullyappliedtosequential[326,426],spatial[371],andgraph-based[374,380,406,450,455]data.

Substantialresearchhasbeenconductedtoextendtheoriginalassociationruleformulationtonominal[425],ordinal[392],interval[395],andratio[356,359,425,443,461]attributes.Oneofthekeyissuesishowtodefinethesupportmeasurefortheseattributes.AmethodologywasproposedbySteinbachetal.[429]toextendthetraditionalnotionofsupporttomoregeneralpatternsandattributetypes.

ImplementationIssuesResearchactivitiesinthisarearevolvearound(1)integratingtheminingcapabilityintoexistingdatabasetechnology,(2)developingefficientandscalableminingalgorithms,(3)handlinguser-specifiedordomain-specificconstraints,and(4)post-processingtheextractedpatterns.

Thereareseveraladvantagestointegratingassociationanalysisintoexistingdatabasetechnology.First,itcanmakeuseoftheindexingandqueryprocessingcapabilitiesofthedatabasesystem.Second,itcanalsoexploittheDBMSsupportforscalability,check-pointing,andparallelization[415].TheSETMalgorithmdevelopedbyHoutsmaetal.[370]wasoneoftheearliestalgorithmstosupportassociationrulediscoveryviaSQLqueries.Sincethen,numerousmethodshavebeendevelopedtoprovidecapabilitiesforminingassociationrulesindatabasesystems.Forexample,theDMQL[363]andM-SQL[373]querylanguagesextendthebasicSQLwithnewoperatorsforminingassociationrules.TheMineRuleoperator[394]isanexpressiveSQLoperatorthatcanhandlebothclusteredattributesanditemhierarchies.Tsuret

al.[439]developedagenerate-and-testapproachcalledqueryflocksforminingassociationrules.AdistributedOLAP-basedinfrastructurewasdevelopedbyChenetal.[341]forminingmultilevelassociationrules.

Despiteitspopularity,theApriorialgorithmiscomputationallyexpensivebecauseitrequiresmakingmultiplepassesoverthetransactiondatabase.ItsruntimeandstoragecomplexitieswereinvestigatedbyDunkelandSoparkar[349].TheFP-growthalgorithmwasdevelopedbyHanetal.in[364].OtheralgorithmsforminingfrequentitemsetsincludetheDHP(dynamichashingandpruning)algorithmproposedbyParketal.[405]andthePartitionalgorithmdevelopedbySavasereetal[417].Asampling-basedfrequentitemsetgenerationalgorithmwasproposedbyToivonen[436].Thealgorithmrequiresonlyasinglepassoverthedata,butitcanproducemorecandidateitem-setsthannecessary.TheDynamicItemsetCounting(DIC)algorithm[337]makesonly1.5passesoverthedataandgenerateslesscandidateitemsetsthanthesampling-basedalgorithm.Othernotablealgorithmsincludethetree-projectionalgorithm[317]andH-Mine[408].Surveyarticlesonfrequentitemsetgenerationalgorithmscanbefoundin[322,367].ArepositoryofbenchmarkdatasetsandsoftwareimplementationofassociationruleminingalgorithmsisavailableattheFrequentItemsetMiningImplementations(FIMI)repository(http://fimi.cs.helsinki.fi).

Parallelalgorithmshavebeendevelopedtoscaleupassociationruleminingforhandlingbigdata[318,360,399,420,457].Asurveyofsuchalgorithmscanbefoundin[453].OnlineandincrementalassociationruleminingalgorithmshavealsobeenproposedbyHidber[365]andCheungetal.[342].Morerecently,newalgorithmshavebeendevelopedtospeedupfrequentitemsetminingbyexploitingtheprocessingpowerofGPUs[459]andtheMapReduce/Hadoopdistributedcomputingframework[382,384,396].Forexample,animplementationoffrequentitemsetminingfortheHadoopframeworkisavailableintheApacheMahoutsoftware .

1

1http://mahout.apache.org

Srikantetal.[427]haveconsideredtheproblemofminingassociationrulesinthepresenceofBooleanconstraintssuchasthefollowing:

Givensuchaconstraint,thealgorithmlooksforrulesthatcontainbothcookiesandmilk,orrulesthatcontainthedescendentitemsofcookiesbutnotancestoritemsofwheatbread.Singhetal.[424]andNgetal.[400]hadalsodevelopedalternativetechniquesforconstrained-basedassociationrulemining.Constraintscanalsobeimposedonthesupportfordifferentitemsets.ThisproblemwasinvestigatedbyWangetal.[442],Liuetal.in[387],andSenoetal.[419].Inaddition,constraintsarisingfromprivacyconcernsofminingsensitivedatahaveledtothedevelopmentofprivacy-preservingfrequentpatternminingtechniques[334,350,441,458].

Onepotentialproblemwithassociationanalysisisthelargenumberofpatternsthatcanbegeneratedbycurrentalgorithms.Toovercomethisproblem,methodstorank,summarize,andfilterpatternshavebeendeveloped.Toivonenetal.[437]proposedtheideaofeliminatingredundantrulesusingstructuralrulecoversandgroupingtheremainingrulesusingclustering.Liuetal.[388]appliedthestatisticalchi-squaretesttoprunespuriouspatternsandsummarizedtheremainingpatternsusingasubsetofthepatternscalleddirectionsettingrules.Theuseofobjectivemeasurestofilterpatternshasbeeninvestigatedbymanyauthors,includingBrinetal.[336],BayardoandAgrawal[331],AggarwalandYu[323],andDuMouchelandPregibon[348].ThepropertiesformanyofthesemeasureswereanalyzedbyPiatetsky-Shapiro[410],KamberandSinghal[376],HildermanandHamilton[366],andTanetal.[433].Thegrade-genderexampleusedtohighlighttheimportanceoftherowandcolumnscalinginvarianceproperty

(Cookies∧Milk)∨(descendants(Cookies)∧¬ancestors(WheatBread))

washeavilyinfluencedbythediscussiongivenin[398]byMosteller.Meanwhile,thetea-coffeeexampleillustratingthelimitationofconfidencewasmotivatedbyanexamplegivenin[336]byBrinetal.Becauseofthelimitationofconfidence,Brinetal.[336]hadproposedtheideaofusinginterestfactorasameasureofinterestingness.Theall-confidencemeasurewasproposedbyOmiecinski[402].Xiongetal.[449]introducedthecross-supportpropertyandshowedthattheall-confidencemeasurecanbeusedtoeliminatecross-supportpatterns.Akeydifficultyinusingalternativeobjectivemeasuresbesidessupportistheirlackofamonotonicityproperty,whichmakesitdifficulttoincorporatethemeasuresdirectlyintotheminingalgorithms.Xiongetal.[447]haveproposedanefficientmethodforminingcorrelationsbyintroducinganupperboundfunctiontothe .Althoughthemeasureisnon-monotone,ithasanupperboundexpressionthatcanbeexploitedfortheefficientminingofstronglycorrelateditempairs.

FabrisandFreitas[351]haveproposedamethodfordiscoveringinterestingassociationsbydetectingtheoccurrencesofSimpson'sparadox[423].MegiddoandSrikant[393]describedanapproachforvalidatingtheextractedpatternsusinghypothesistestingmethods.Aresampling-basedtechniquewasalsodevelopedtoavoidgeneratingspuriouspatternsbecauseofthemultiplecomparisonproblem.Boltonetal.[335]haveappliedtheBenjamini-Hochberg[332]andBonferronicorrectionmethodstoadjustthep-valuesofdiscoveredpatternsinmarketbasketdata.AlternativemethodsforhandlingthemultiplecomparisonproblemweresuggestedbyWebb[445],Zhangetal.[460],andLlinares-Lopezetal.[391].

Applicationofsubjectivemeasurestoassociationanalysishasbeeninvestigatedbymanyauthors.SilberschatzandTuzhilin[421]presentedtwoprinciplesinwhicharulecanbeconsideredinterestingfromasubjectivepointofview.TheconceptofunexpectedconditionruleswasintroducedbyLiuetal.in[385].Cooleyetal.[344]analyzedtheideaofcombiningsoftbeliefsets

ϕ-coefficient

usingtheDempster-Shafertheoryandappliedthisapproachtoidentifycontradictoryandnovelassociationpatternsinwebdata.AlternativeapproachesincludeusingBayesiannetworks[375]andneighborhood-basedinformation[346]toidentifysubjectivelyinterestingpatterns.

Visualizationalsohelpstheusertoquicklygrasptheunderlyingstructureofthediscoveredpatterns.Manycommercialdataminingtoolsdisplaythecompletesetofrules(whichsatisfybothsupportandconfidencethresholdcriteria)asatwo-dimensionalplot,witheachaxiscorrespondingtotheantecedentorconsequentitemsetsoftherule.Hofmannetal.[368]proposedusingMosaicplotsandDoubleDeckerplotstovisualizeassociationrules.Thisapproachcanvisualizenotonlyaparticularrule,butalsotheoverallcontingencytablebetweenitemsetsintheantecedentandconsequentpartsoftherule.Nevertheless,thistechniqueassumesthattheruleconsequentconsistsofonlyasingleattribute.

ApplicationIssuesAssociationanalysishasbeenappliedtoavarietyofapplicationdomainssuchaswebmining[409,432],documentanalysis[369],telecommunicationalarmdiagnosis[377],networkintrusiondetection[328,345,381],andbioinformatics[416,446].ApplicationsofassociationandcorrelationpatternanalysistoEarthSciencestudieshavebeeninvestigatedin[411,412,434].Trajectorypatternmining[339,372,438]isanotherapplicationofspatio-temporalassociationanalysistoidentifyfrequentlytraversedpathsofmovingobjects.

Associationpatternshavealsobeenappliedtootherlearningproblemssuchasclassification[383,386],regression[404],andclustering[361,448,452].AcomparisonbetweenclassificationandassociationruleminingwasmadebyFreitasinhispositionpaper[354].Theuseofassociationpatternsfor

clusteringhasbeenstudiedbymanyauthorsincludingHanetal.[361],Kostersetal.[378],Yangetal.[452]andXiongetal.[448].

Bibliography[317]R.C.Agarwal,C.C.Aggarwal,andV.V.V.Prasad.ATreeProjection

AlgorithmforGenerationofFrequentItemsets.JournalofParallelandDistributedComputing(SpecialIssueonHighPerformanceDataMining),61(3):350–371,2001.

[318]R.C.AgarwalandJ.C.Shafer.ParallelMiningofAssociationRules.IEEETransactionsonKnowledgeandDataEngineering,8(6):962–969,March1998.

[319]C.AggarwalandJ.Han.FrequentPatternMining.Springer,2014.

[320]C.C.Aggarwal,Y.Li,J.Wang,andJ.Wang.Frequentpatternminingwithuncertaindata.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages29–38,Paris,France,2009.

[321]C.C.Aggarwal,Z.Sun,andP.S.Yu.OnlineGenerationofProfileAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages129—133,NewYork,NY,August1996.

[322]C.C.AggarwalandP.S.Yu.MiningLargeItemsetsforAssociationRules.DataEngineeringBulletin,21(1):23–31,March1998.

[323]C.C.AggarwalandP.S.Yu.MiningAssociationswiththeCollectiveStrengthApproach.IEEETrans.onKnowledgeandDataEngineering,13(6):863–873,January/February2001.

[324]R.Agrawal,T.Imielinski,andA.Swami.Databasemining:Aperformanceperspective.IEEETransactionsonKnowledgeandDataEngineering,5:914–925,1993.

[325]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages207–216,Washington,DC,1993.

[326]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.

[327]K.Ali,S.Manganaris,andR.Srikant.PartialClassificationusingAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages115—118,NewportBeach,CA,August1997.

[328]D.Barbará,J.Couto,S.Jajodia,andN.Wu.ADAM:ATestbedforExploringtheUseofDataMininginIntrusionDetection.SIGMODRecord,30(4):15–24,2001.

[329]S.D.BayandM.Pazzani.DetectingGroupDifferences:MiningContrastSets.DataMiningandKnowledgeDiscovery,5(3):213–246,2001.

[330]R.Bayardo.EfficientlyMiningLongPatternsfromDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages85–93,Seattle,WA,June1998.

[331]R.BayardoandR.Agrawal.MiningtheMostInterestingRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages145–153,SanDiego,CA,August1999.

[332]Y.BenjaminiandY.Hochberg.ControllingtheFalseDiscoveryRate:APracticalandPowerfulApproachtoMultipleTesting.JournalRoyalStatisticalSocietyB,57(1):289–300,1995.

[333]T.Bernecker,H.Kriegel,M.Renz,F.Verhein,andA.Züle.Probabilisticfrequentitemsetmininginuncertaindatabases.InProceedingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages119–128,Paris,France,2009.

[334]R.Bhaskar,S.Laxman,A.D.Smith,andA.Thakurta.Discoveringfrequentpatternsinsensitivedata.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages503–512,Washington,DC,2010.

[335]R.J.Bolton,D.J.Hand,andN.M.Adams.DeterminingHitRateinPatternSearch.InProc.oftheESFExploratoryWorkshoponPatternDetectionandDiscoveryinDataMining,pages36–48,London,UK,September2002.

[336]S.Brin,R.Motwani,andC.Silverstein.Beyondmarketbaskets:Generalizingassociationrulestocorrelations.InProc.ACMSIGMODIntl.Conf.ManagementofData,pages265–276,Tucson,AZ,1997.

[337]S.Brin,R.Motwani,J.Ullman,andS.Tsur.DynamicItemsetCountingandImplicationRulesformarketbasketdata.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages255–264,Tucson,AZ,June1997.

[338]C.H.Cai,A.Fu,C.H.Cheng,andW.W.Kwong.MiningAssociationRuleswithWeightedItems.InProc.ofIEEEIntl.DatabaseEngineeringandApplicationsSymp.,pages68–77,Cardiff,Wales,1998.

[339]H.Cao,N.Mamoulis,andD.W.Cheung.MiningFrequentSpatio-TemporalSequentialPatterns.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages82–89,Houston,TX,2005.

[340]R.Chan,Q.Yang,andY.Shen.MiningHighUtilityItemsets.InProceedingsofthe3rdIEEEInternationalConferenceonDataMining,pages19–26,Melbourne,FL,2003.

[341]Q.Chen,U.Dayal,andM.Hsu.ADistributedOLAPinfrastructureforE-Commerce.InProc.ofthe4thIFCISIntl.Conf.onCooperativeInformationSystems,pages209—220,Edinburgh,Scotland,1999.

[342]D.C.Cheung,S.D.Lee,andB.Kao.AGeneralIncrementalTechniqueforMaintainingDiscoveredAssociationRules.InProc.ofthe5thIntl.Conf.

onDatabaseSystemsforAdvancedApplications,pages185–194,Melbourne,Australia,1997.

[343]C.K.Chui,B.Kao,andE.Hung.MiningFrequentItemsetsfromUncertainData.InProceedingsofthe11thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages47–58,Nanjing,China,2007.

[344]R.Cooley,P.N.Tan,andJ.Srivastava.DiscoveryofInterestingUsagePatternsfromWebData.InM.SpiliopoulouandB.Masand,editors,AdvancesinWebUsageAnalysisandUserProfiling,volume1836,pages163–182.LectureNotesinComputerScience,2000.

[345]P.Dokas,L.Ertöz,V.Kumar,A.Lazarevic,J.Srivastava,andP.N.Tan.DataMiningforNetworkIntrusionDetection.InProc.NSFWorkshoponNextGenerationDataMining,Baltimore,MD,2002.

[346]G.DongandJ.Li.Interestingnessofdiscoveredassociationrulesintermsofneighborhood-basedunexpectedness.InProc.ofthe2ndPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages72–86,Melbourne,Australia,April1998.

[347]G.DongandJ.Li.EfficientMiningofEmergingPatterns:DiscoveringTrendsandDifferences.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages43–52,SanDiego,CA,August1999.

[348]W.DuMouchelandD.Pregibon.EmpiricalBayesScreeningforMulti-ItemAssociations.InProc.ofthe7thIntl.Conf.onKnowledgeDiscovery

andDataMining,pages67–76,SanFrancisco,CA,August2001.

[349]B.DunkelandN.Soparkar.DataOrganizationandAccessforEfficientDataMining.InProc.ofthe15thIntl.Conf.onDataEngineering,pages522–529,Sydney,Australia,March1999.

[350]A.V.Evfimievski,R.Srikant,R.Agrawal,andJ.Gehrke.Privacypreservingminingofassociationrules.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages217–228,Edmonton,Canada,2002.

[351]C.C.FabrisandA.A.Freitas.DiscoveringsurprisingpatternsbydetectingoccurrencesofSimpson'sparadox.InProc.ofthe19thSGESIntl.Conf.onKnowledge-BasedSystemsandAppliedArtificialIntelligence),pages148–160,Cambridge,UK,December1999.

[352]G.Fang,G.Pandey,W.Wang,M.Gupta,M.Steinbach,andV.Kumar.MiningLow-SupportDiscriminativePatternsfromDenseandHigh-DimensionalData.IEEETrans.Knowl.DataEng.,24(2):279–294,2012.

[353]L.Feng,H.J.Lu,J.X.Yu,andJ.Han.Mininginter-transactionassociationswithtemplates.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages225–233,KansasCity,Missouri,Nov1999.

[354]A.A.Freitas.Understandingthecrucialdifferencesbetweenclassificationanddiscoveryofassociationrules—apositionpaper.SIGKDDExplorations,2(1):65–69,2000.

[355]J.H.FriedmanandN.I.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9(2):123–143,April1999.

[356]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.

[357]D.Gunopulos,R.Khardon,H.Mannila,andH.Toivonen.DataMining,HypergraphTransversals,andMachineLearning.InProc.ofthe16thSymp.onPrinciplesofDatabaseSystems,pages209–216,Tucson,AZ,May1997.

[358]R.Gupta,G.Fang,B.Field,M.Steinbach,andV.Kumar.Quantitativeevaluationofapproximatefrequentpatternminingalgorithms.InProceedingsofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages301–309,LasVegas,NV,2008.

[359]E.Han,G.Karypis,andV.Kumar.Min-apriori:Analgorithmforfindingassociationrulesindatawithcontinuousattributes.DepartmentofComputerScienceandEngineering,UniversityofMinnesota,Tech.Rep,1997.

[360]E.-H.Han,G.Karypis,andV.Kumar.ScalableParallelDataMiningforAssociationRules.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages277–288,Tucson,AZ,May1997.

[361]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.ClusteringBasedonAssociationRuleHypergraphs.InProc.ofthe1997ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Tucson,AZ,1997.

[362]J.Han,H.Cheng,D.Xin,andX.Yan.Frequentpatternmining:currentstatusandfuturedirections.DataMiningandKnowledgeDiscovery,15(1):55–86,2007.

[363]J.Han,Y.Fu,K.Koperski,W.Wang,andO.R.Zaïane.DMQL:Adataminingquerylanguageforrelationaldatabases.InProc.ofthe1996ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Montreal,Canada,June1996.

[364]J.Han,J.Pei,andY.Yin.MiningFrequentPatternswithoutCandidateGeneration.InProc.ACM-SIGMODInt.Conf.onManagementofData(SIGMOD'00),pages1–12,Dallas,TX,May2000.

[365]C.Hidber.OnlineAssociationRuleMining.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages145–156,Philadelphia,PA,1999.

[366]R.J.HildermanandH.J.Hamilton.KnowledgeDiscoveryandMeasuresofInterest.KluwerAcademicPublishers,2001.

[367]J.Hipp,U.Guntzer,andG.Nakhaeizadeh.AlgorithmsforAssociationRuleMining—AGeneralSurvey.SigKDDExplorations,2(1):58–64,June

2000.

[368]H.Hofmann,A.P.J.M.Siebes,andA.F.X.Wilhelm.VisualizingAssociationRuleswithInteractiveMosaicPlots.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages227–235,Boston,MA,August2000.

[369]J.D.HoltandS.M.Chung.EfficientMiningofAssociationRulesinTextDatabases.InProc.ofthe8thIntl.Conf.onInformationandKnowledgeManagement,pages234–242,KansasCity,Missouri,1999.

[370]M.HoutsmaandA.Swami.Set-orientedMiningforAssociationRulesinRelationalDatabases.InProc.ofthe11thIntl.Conf.onDataEngineering,pages25–33,Taipei,Taiwan,1995.

[371]Y.Huang,S.Shekhar,andH.Xiong.DiscoveringCo-locationPatternsfromSpatialDatasets:AGeneralApproach.IEEETrans.onKnowledgeandDataEngineering,16(12):1472–1485,December2004.

[372]S.Hwang,Y.Liu,J.Chiu,andE.Lim.MiningMobileGroupPatterns:ATrajectory-BasedApproach.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages713–718,Hanoi,Vietnam,2005.

[373]T.Imielinski,A.Virmani,andA.Abdulghani.DataMine:ApplicationProgrammingInterfaceandQueryLanguageforDatabaseMining.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages256–262,Portland,Oregon,1996.

[374]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.

[375]S.JaroszewiczandD.Simovici.InterestingnessofFrequentItemsetsUsingBayesianNetworksasBackgroundKnowledge.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages178–186,Seattle,WA,August2004.

[376]M.KamberandR.Shinghal.EvaluatingtheInterestingnessofCharacteristicRules.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages263–266,Portland,Oregon,1996.

[377]M.Klemettinen.AKnowledgeDiscoveryMethodologyforTelecommunicationNetworkAlarmDatabases.PhDthesis,UniversityofHelsinki,1999.

[378]W.A.Kosters,E.Marchiori,andA.Oerlemans.MiningClusterswithAssociationRules.InThe3rdSymp.onIntelligentDataAnalysis(IDA99),pages39–50,Amsterdam,August1999.

[379]C.M.Kuok,A.Fu,andM.H.Wong.MiningFuzzyAssociationRulesinDatabases.ACMSIGMODRecord,27(1):41–46,March1998.

[380]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,

November2001.

[381]W.Lee,S.J.Stolfo,andK.W.Mok.AdaptiveIntrusionDetection:ADataMiningApproach.ArtificialIntelligenceReview,14(6):533–567,2000.

[382]N.Li,L.Zeng,Q.He,andZ.Shi.ParallelImplementationofAprioriAlgorithmBasedonMapReduce.InProceedingsofthe13thACISInternationalConferenceonSoftwareEngineering,ArtificialIntelligence,NetworkingandParallel/DistributedComputing,pages236–241,Kyoto,Japan,2012.

[383]W.Li,J.Han,andJ.Pei.CMAR:AccurateandEfficientClassificationBasedonMultipleClass-associationRules.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages369–376,SanJose,CA,2001.

[384]M.Lin,P.Lee,andS.Hsueh.Apriori-basedfrequentitemsetminingalgorithmsonMapReduce.InProceedingsofthe6thInternationalConferenceonUbiquitousInformationManagementandCommunication,pages26–30,KualaLumpur,Malaysia,2012.

[385]B.Liu,W.Hsu,andS.Chen.UsingGeneralImpressionstoAnalyzeDiscoveredClassificationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages31–36,NewportBeach,CA,August1997.

[386]B.Liu,W.Hsu,andY.Ma.IntegratingClassificationandAssociationRuleMining.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages80–86,NewYork,NY,August1998.

[387]B.Liu,W.Hsu,andY.Ma.Miningassociationruleswithmultipleminimumsupports.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125—134,SanDiego,CA,August1999.

[388]B.Liu,W.Hsu,andY.Ma.PruningandSummarizingtheDiscoveredAssociations.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages125–134,SanDiego,CA,August1999.

[389]J.Liu,S.Paulsen,W.Wang,A.B.Nobel,andJ.Prins.MiningApproximateFrequentItemsetsfromNoisyData.InProceedingsofthe5thIEEEInternationalConferenceonDataMining,pages721–724,Houston,TX,2005.

[390]Y.Liu,W.-K.Liao,andA.Choudhary.Atwo-phasealgorithmforfastdiscoveryofhighutilityitemsets.InProceedingsofthe9thPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,pages689–695,Hanoi,Vietnam,2005.

[391]F.Llinares-López,M.Sugiyama,L.Papaxanthos,andK.M.Borgwardt.FastandMemory-EfficientSignificantPatternMiningviaPermutationTesting.InProceedingsofthe21thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages725–734,Sydney,Australia,2015.

[392]A.Marcus,J.I.Maletic,andK.-I.Lin.Ordinalassociationrulesforerroridentificationindatasets.InProc.ofthe10thIntl.Conf.onInformationandKnowledgeManagement,pages589–591,Atlanta,GA,October2001.

[393]N.MegiddoandR.Srikant.DiscoveringPredictiveAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages274–278,NewYork,August1998.

[394]R.Meo,G.Psaila,andS.Ceri.ANewSQL-likeOperatorforMiningAssociationRules.InProc.ofthe22ndVLDBConf.,pages122–133,Bombay,India,1996.

[395]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.

[396]S.Moens,E.Aksehirli,andB.Goethals.FrequentItemsetMiningforBigData.InProceedingsofthe2013IEEEInternationalConferenceonBigData,pages111–118,SantaClara,CA,2013.

[397]Y.Morimoto,T.Fukuda,H.Matsuzawa,T.Tokuyama,andK.Yoda.Algorithmsforminingassociationrulesforbinarysegmentationsofhugecategoricaldatabases.InProc.ofthe24thVLDBConf.,pages380–391,NewYork,August1998.

[398]F.Mosteller.AssociationandEstimationinContingencyTables.JASA,63:1–28,1968.

[399]A.Mueller.Fastsequentialandparallelalgorithmsforassociationrulemining:Acomparison.TechnicalReportCS-TR-3515,UniversityofMaryland,August1995.

[400]R.T.Ng,L.V.S.Lakshmanan,J.Han,andA.Pang.ExploratoryMiningandPruningOptimizationsofConstrainedAssociationRules.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages13–24,Seattle,WA,June1998.

[401]P.K.Novak,N.Lavrač,andG.I.Webb.Superviseddescriptiverulediscovery:Aunifyingsurveyofcontrastset,emergingpatternandsubgroupmining.JournalofMachineLearningResearch,10(Feb):377–403,2009.

[402]E.Omiecinski.AlternativeInterestMeasuresforMiningAssociationsinDatabases.IEEETrans.onKnowledgeandDataEngineering,15(1):57–69,January/February2003.

[403]B.Ozden,S.Ramaswamy,andA.Silberschatz.CyclicAssociationRules.InProc.ofthe14thIntl.Conf.onDataEng.,pages412–421,Orlando,FL,February1998.

[404]A.Ozgur,P.N.Tan,andV.Kumar.RBA:AnIntegratedFrameworkforRegressionbasedonAssociationRules.InProc.oftheSIAMIntl.Conf.onDataMining,pages210–221,Orlando,FL,April2004.

[405]J.S.Park,M.-S.Chen,andP.S.Yu.Aneffectivehash-basedalgorithmforminingassociationrules.SIGMODRecord,25(2):175–186,1995.

[406]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.on

DataMining,pages362—369,MaebashiCity,Japan,December2002.

[407]N.Pasquier,Y.Bastide,R.Taouil,andL.Lakhal.Discoveringfrequentcloseditemsetsforassociationrules.InProc.ofthe7thIntl.Conf.onDatabaseTheory(ICDT'99),pages398–416,Jerusalem,Israel,January1999.

[408]J.Pei,J.Han,H.J.Lu,S.Nishio,andS.Tang.H-Mine:Hyper-StructureMiningofFrequentPatternsinLargeDatabases.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages441–448,SanJose,CA,November2001.

[409]J.Pei,J.Han,B.Mortazavi-Asl,andH.Zhu.MiningAccessPatternsEfficientlyfromWebLogs.InProc.ofthe4thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages396–407,Kyoto,Japan,April2000.

[410]G.Piatetsky-Shapiro.Discovery,AnalysisandPresentationofStrongRules.InG.Piatetsky-ShapiroandW.Frawley,editors,KnowledgeDiscoveryinDatabases,pages229–248.MITPress,Cambridge,MA,1991.

[411]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,andC.Carvalho.UnderstandingGlobalTeleconnectionsofClimatetoRegionalModelEstimatesofAmazonEcosystemCarbonFluxes.GlobalChangeBiology,10(5):693—703,2004.

[412]C.Potter,S.Klooster,M.Steinbach,P.N.Tan,V.Kumar,S.Shekhar,R.Myneni,andR.Nemani.GlobalTeleconnectionsofOceanClimatetoTerrestrialCarbonFlux.JournalofGeophysicalResearch,108(D17),2003.

[413]G.D.Ramkumar,S.Ranka,andS.Tsur.Weightedassociationrules:Modelandalgorithm.InProc.ACMSIGKDD,1998.

[414]M.RiondatoandF.Vandin.FindingtheTrueFrequentItemsets.InProceedingsofthe2014SIAMInternationalConferenceonDataMining,pages497–505,Philadelphia,PA,2014.

[415]S.Sarawagi,S.Thomas,andR.Agrawal.IntegratingMiningwithRelationalDatabaseSystems:AlternativesandImplications.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages343–354,Seattle,WA,1998.

[416]K.Satou,G.Shibayama,T.Ono,Y.Yamamura,E.Furuichi,S.Kuhara,andT.Takagi.FindingAssociationRulesonHeterogeneousGenomeData.InProc.ofthePacificSymp.onBiocomputing,pages397–408,Hawaii,January1997.

[417]A.Savasere,E.Omiecinski,andS.Navathe.Anefficientalgorithmforminingassociationrulesinlargedatabases.InProc.ofthe21stInt.Conf.onVeryLargeDatabases(VLDB’95),pages432–444,Zurich,Switzerland,September1995.

[418]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.

[419]M.SenoandG.Karypis.LPMiner:AnAlgorithmforFindingFrequentItemsetsUsingLength-DecreasingSupportConstraint.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages505–512,SanJose,CA,November2001.

[420]T.ShintaniandM.Kitsuregawa.Hashbasedparallelalgorithmsforminingassociationrules.InProcofthe4thIntl.Conf.onParallelandDistributedInfo.Systems,pages19–30,MiamiBeach,FL,December1996.

[421]A.SilberschatzandA.Tuzhilin.Whatmakespatternsinterestinginknowledgediscoverysystems.IEEETrans.onKnowledgeandDataEngineering,8(6):970–974,1996.

[422]C.Silverstein,S.Brin,andR.Motwani.Beyondmarketbaskets:Generalizingassociationrulestodependencerules.DataMiningandKnowledgeDiscovery,2(1):39–68,1998.

[423]E.-H.Simpson.TheInterpretationofInteractioninContingencyTables.JournaloftheRoyalStatisticalSociety,B(13):238–241,1951.

[424]L.Singh,B.Chen,R.Haight,andP.Scheuermann.AnAlgorithmforConstrainedAssociationRuleMininginSemi-structuredData.InProc.of

the3rdPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,pages148–158,Beijing,China,April1999.

[425]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.

[426]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT'96),pages18–32,Avignon,France,1996.

[427]R.Srikant,Q.Vu,andR.Agrawal.MiningAssociationRuleswithItemConstraints.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages67–73,NewportBeach,CA,August1997.

[428]M.Steinbach,P.N.Tan,andV.Kumar.SupportEnvelopes:ATechniqueforExploringtheStructureofAssociationPatterns.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages296–305,Seattle,WA,August2004.

[429]M.Steinbach,P.N.Tan,H.Xiong,andV.Kumar.ExtendingtheNotionofSupport.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages689–694,Seattle,WA,August2004.

[430]M.Steinbach,H.Yu,G.Fang,andV.Kumar.Usingconstraintstogenerateandexplorehigherorderdiscriminativepatterns.AdvancesinKnowledgeDiscoveryandDataMining,pages338–350,2011.

[431]E.Suzuki.AutonomousDiscoveryofReliableExceptionRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages259–262,NewportBeach,CA,August1997.

[432]P.N.TanandV.Kumar.MiningAssociationPatternsinWebUsageData.InProc.oftheIntl.Conf.onAdvancesinInfrastructurefore-Business,e-Education,e-Scienceande-MedicineontheInternet,L'Aquila,Italy,January2002.

[433]P.N.Tan,V.Kumar,andJ.Srivastava.SelectingtheRightInterestingnessMeasureforAssociationPatterns.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages32–41,Edmonton,Canada,July2002.

[434]P.N.Tan,M.Steinbach,V.Kumar,S.Klooster,C.Potter,andA.Torregrosa.FindingSpatio-TemporalPatternsinEarthScienceData.InKDD2001WorkshoponTemporalDataMining,SanFrancisco,CA,2001.

[435]N.Tatti.Probablythebestitemsets.InProceedingsofthe16thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages293–302,Washington,DC,2010.

[436]H.Toivonen.SamplingLargeDatabasesforAssociationRules.InProc.ofthe22ndVLDBConf.,pages134–145,Bombay,India,1996.

[437]H.Toivonen,M.Klemettinen,P.Ronkainen,K.Hatonen,andH.Mannila.PruningandGroupingDiscoveredAssociationRules.InECML-95

WorkshoponStatistics,MachineLearningandKnowledgeDiscoveryinDatabases,pages47–52,Heraklion,Greece,April1995.

[438]I.TsoukatosandD.Gunopulos.Efficientminingofspatiotemporalpatterns.InProceedingsofthe7thInternationalSymposiumonAdvancesinSpatialandTemporalDatabases,pages425–442,2001.

[439]S.Tsur,J.Ullman,S.Abiteboul,C.Clifton,R.Motwani,S.Nestorov,andA.Rosenthal.QueryFlocks:AGeneralizationofAssociationRuleMining.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Seattle,WA,June1998.

[440]A.Tung,H.J.Lu,J.Han,andL.Feng.BreakingtheBarrierofTransactions:MiningInter-TransactionAssociationRules.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages297–301,SanDiego,CA,August1999.

[441]J.VaidyaandC.Clifton.Privacypreservingassociationrulemininginverticallypartitioneddata.InProceedingsoftheEighthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages639–644,Edmonton,Canada,2002.

[442]K.Wang,Y.He,andJ.Han.MiningFrequentItemsetsUsingSupportConstraints.InProc.ofthe26thVLDBConf.,pages43–52,Cairo,Egypt,September2000.

[443]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledge

DiscoveryandDataMining,pages121–128,NewYork,NY,August1998.

[444]L.Wang,R.Cheng,S.D.Lee,andD.W.Cheung.Acceleratingprobabilisticfrequentitemsetmining:amodel-basedapproach.InProceedingsofthe19thACMConferenceonInformationandKnowledgeManagement,pages429–438,2010.

[445]G.I.Webb.Preliminaryinvestigationsintostatisticallyvalidexploratoryrulediscovery.InProc.oftheAustralasianDataMiningWorkshop(AusDM03),Canberra,Australia,December2003.

[446]H.Xiong,X.He,C.Ding,Y.Zhang,V.Kumar,andS.R.Holbrook.IdentificationofFunctionalModulesinProteinComplexesviaHypercliquePatternDiscovery.InProc.ofthePacificSymposiumonBiocomputing,(PSB2005),Maui,January2005.

[447]H.Xiong,S.Shekhar,P.N.Tan,andV.Kumar.ExploitingaSupport-basedUpperBoundofPearson'sCorrelationCoefficientforEfficientlyIdentifyingStronglyCorrelatedPairs.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages334–343,Seattle,WA,August2004.

[448]H.Xiong,M.Steinbach,P.N.Tan,andV.Kumar.HICAP:HierarchialClusteringwithPatternPreservation.InProc.oftheSIAMIntl.Conf.onDataMining,pages279–290,Orlando,FL,April2004.

[449]H.Xiong,P.N.Tan,andV.Kumar.MiningStrongAffinityAssociationPatternsinDataSetswithSkewedSupportDistribution.InProc.ofthe

2003IEEEIntl.Conf.onDataMining,pages387–394,Melbourne,FL,2003.

[450]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.

[451]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProceedingsoftheseventhACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages194–203,,SanFrancisco,CA,2001.

[452]C.Yang,U.M.Fayyad,andP.S.Bradley.Efficientdiscoveryoferror-tolerantfrequentitemsetsinhighdimensions.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages194–203,SanFrancisco,CA,August2001.

[453]M.J.Zaki.ParallelandDistributedAssociationMining:ASurvey.IEEEConcurrency,specialissueonParallelMechanismsforDataMining,7(4):14–25,December1999.

[454]M.J.Zaki.GeneratingNon-RedundantAssociationRules.InProc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages34–43,Boston,MA,August2000.

[455]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.

[456]M.J.ZakiandM.Orihara.Theoreticalfoundationsofassociationrules.InProc.ofthe1998ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,Seattle,WA,June1998.

[457]M.J.Zaki,S.Parthasarathy,M.Ogihara,andW.Li.NewAlgorithmsforFastDiscoveryofAssociationRules.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryandDataMining,pages283–286,NewportBeach,CA,August1997.

[458]C.Zeng,J.F.Naughton,andJ.Cai.Ondifferentiallyprivatefrequentitemsetmining.ProceedingsoftheVLDBEndowment,6(1):25–36,2012.

[459]F.Zhang,Y.Zhang,andJ.Bakos.GPApriori:GPU-AcceleratedFrequentItemsetMining.InProceedingsofthe2011IEEEInternationalConferenceonClusterComputing,pages590–594,Austin,TX,2011.

[460]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.

[461]Z.Zhang,Y.Lu,andB.Zhang.AnEffectivePartioning-CombiningAlgorithmforDiscoveringQuantitativeAssociationRules.InProc.ofthe1stPacific-AsiaConf.onKnowledgeDiscoveryandDataMining,Singapore,1997.

[462]N.Zhong,Y.Y.Yao,andS.Ohsuga.PeculiarityOrientedMulti-databaseMining.InProc.ofthe3rdEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages136–146,Prague,CzechRepublic,1999.

5.10Exercises1.Foreachofthefollowingquestions,provideanexampleofanassociationrulefromthemarketbasketdomainthatsatisfiesthefollowingconditions.Also,describewhethersuchrulesaresubjectivelyinteresting.

a. Arulethathashighsupportandhighconfidence.

b. Arulethathasreasonablyhighsupportbutlowconfidence.

c. Arulethathaslowsupportandlowconfidence.

d. Arulethathaslowsupportandhighconfidence.

2.ConsiderthedatasetshowninTable5.20 .

Table5.20.Exampleofmarketbaskettransactions.

CustomerID TransactionID ItemsBought

1 0001 {a,d,e}

1 0024 {a,b,c,e}

2 0012 {a,b,d,e}

2 0031 {a,c,d,e}

3 0015 {b,ce}

3 0022 {b,d,e}

4 0029 {cd}

4 0040 {a,b,c}

5 0033 {a,d,e}

5 0038 {a,b,e}

a. Computethesupportforitemsets{e},{b,d},and{b,d,e}bytreatingeachtransactionIDasamarketbasket.

b. Usetheresultsinpart(a)tocomputetheconfidencefortheassociationrules and .Isconfidenceasymmetricmeasure?

c. Repeatpart(a)bytreatingeachcustomerIDasamarketbasket.Eachitemshouldbetreatedasabinaryvariable(1ifanitemappearsinatleastonetransactionboughtbythecustomer,and0otherwise).

d. Usetheresultsinpart(c)tocomputetheconfidencefortheassociationrules and .

e. Suppose and arethesupportandconfidencevaluesofanassociationrulerwhentreatingeachtransactionIDasamarketbasket.Also,let and bethesupportandconfidencevaluesofrwhentreatingeachcustomerIDasamarketbasket.Discusswhetherthereareanyrelationshipsbetween and or and .

3.

a. Whatistheconfidencefortherules and ?

b. Let ,and betheconfidencevaluesoftherules,and ,respectively.Ifweassumethat ,and

havedifferentvalues,whatarethepossiblerelationshipsthatmayexistamong ,and ?Whichrulehasthelowestconfidence?

c. Repeattheanalysisinpart(b)assumingthattheruleshaveidenticalsupport.Whichrulehasthehighestconfidence?

{b,d}→{e} {e}→{b,d}

{b,d}→{e} {e}→{b,d}

s1 c1

s2 c2

s1 s2 c1 c2

∅→A A→∅c1,c2 c3 {p}→{q},{p}

→{q,r} {p,r}→{q} c1,c2 c3

c1,c2 c3

d. Transitivity:Supposetheconfidenceoftherules and arelargerthansomethreshold,minconf.Isitpossiblethat hasaconfidencelessthanminconf?

4.Foreachofthefollowingmeasures,determinewhetheritismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone).

Example:Support, isanti-monotonebecause whenever.

a. Acharacteristicruleisaruleoftheform ,wheretheruleantecedentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokcharacteristicrules.Let betheminimumconfidenceofallcharacteristicrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

b. Adiscriminantruleisaruleoftheform ,wheretheruleconsequentcontainsonlyasingleitem.Anitemsetofsizekcanproduceuptokdiscriminantrules.Let betheminimumconfidenceofalldiscriminantrulesgeneratedfromagivenitemset:

Is monotone,anti-monotone,ornon-monotone?

c. Repeattheanalysisinparts(a)and(b)byreplacingtheminfunctionwithamaxfunction.

A→B B→CA→C

s=σ(x)|T| s(X)≥s(Y)X⊂Y

{p}→{q1,q2,…,qn}

ζ

ζ({p1,p2,…,pk})=min[c({p1}→{p2,p3,…,pk}),…c({pk}→{p1,p2,…,pk−1})]

ζ

{p1,p2,…,pn}→{q}

η

η({p1,p2,…,pk})=min[c({p2,p3,…,pk}→{p1}),…c({p1,p2,…,pk−1}→{pk})]

η

5.ProveEquation5.3 .(Hint:First,countthenumberofwaystocreateanitemsetthatformstheleft-handsideoftherule.Next,foreachsizekitemsetselectedfortheleft-handside,countthenumberofwaystochoosetheremaining itemstoformtheright-handsideoftherule.)Assumethatneitheroftheitemsetsofaruleareempty.

6.ConsiderthemarketbaskettransactionsshowninTable5.21 .

a. Whatisthemaximumnumberofassociationrulesthatcanbeextractedfromthisdata(includingrulesthathavezerosupport)?

b. Whatisthemaximumsizeoffrequentitemsetsthatcanbeextracted(assuming )?

Table5.21.Marketbaskettransactions.

TransactionID ItemsBought

1 {Milk,Beer,Diapers}

2 {Bread,Butter,Milk}

3 {Milk,Diapers,Cookies}

4 {Bread,Butter,Cookies}

5 {Beer,Cookies,Diapers}

6 {Milk,Diapers,Bread,Butter}

7 {Bread,Butter,Diapers}

8 {Beer,Diapers}

9 {Milk,Diapers,Bread,Butter}

10 {Beer,Cookies}

d−k

minsup>0

c. Writeanexpressionforthemaximumnumberofsize-3itemsetsthatcanbederivedfromthisdataset.

d. Findanitemset(ofsize2orlarger)thathasthelargestsupport.

e. Findapairofitems,aandb,suchthattherules andhavethesameconfidence.

7.Showthatifacandidatek-itemsetXhasasubsetofsizelessthan thatisinfrequent,thenatleastoneofthe -sizesubsetsofXisnecessarilyinfrequent.

8.Considerthefollowingsetoffrequent3-itemsets:

{1,2,3},{1,2,4},{1,2,5},{1,3,4},{1,3,5},{2,3,4},{2,3,5},{3,4,5}.

Assumethatthereareonlyfiveitemsinthedataset.

a. Listallcandidate4-itemsetsobtainedbyacandidategenerationprocedureusingthe mergingstrategy.

b. Listallcandidate4-itemsetsobtainedbythecandidategenerationprocedureinApriori.

c. Listallcandidate4-itemsetsthatsurvivethecandidatepruningstepoftheApriorialgorithm.

9.TheApriorialgorithmusesagenerate-and-countstrategyforderivingfrequentitemsets.Candidateitemsetsofsize arecreatedbyjoiningapairoffrequentitemsetsofsizek(thisisknownasthecandidategenerationstep).Acandidateisdiscardedifanyoneofitssubsetsisfoundtobeinfrequentduringthecandidatepruningstep.SupposetheApriorialgorithmisappliedtothedatasetshowninTable5.22 with ,i.e.,anyitemsetoccurringinlessthan3transactionsisconsideredtobeinfrequent.

{a}→{b} {b}→{a}

k−1(k−1)

Fk−1×F1

k+1

minsup=30%

Table5.22.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,cd}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

a. DrawanitemsetlatticerepresentingthedatasetgiveninTable5.22 .Labeleachnodeinthelatticewiththefollowingletter(s):

N:IftheitemsetisnotconsideredtobeacandidateitemsetbytheApriorialgorithm.Therearetworeasonsforanitemsetnottobeconsideredasacandidateitemset:(1)itisnotgeneratedatallduringthecandidategenerationstep,or(2)itisgeneratedduringthecandidategenerationstepbutissubsequentlyremovedduringthecandidatepruningstepbecauseoneofitssubsetsisfoundtobeinfrequent.

F:IfthecandidateitemsetisfoundtobefrequentbytheApriorialgorithm.

I:Ifthecandidateitemsetisfoundtobeinfrequentaftersupportcounting.

b. Whatisthepercentageoffrequentitemsets(withrespecttoallitemsetsinthelattice)?

c. WhatisthepruningratiooftheApriorialgorithmonthisdataset?(Pruningratioisdefinedasthepercentageofitemsetsnotconsideredtobeacandidatebecause(1)theyarenotgeneratedduringcandidategenerationor(2)theyareprunedduringthecandidatepruningstep.)

d. Whatisthefalsealarmrate(i.e.,percentageofcandidateitemsetsthatarefoundtobeinfrequentafterperformingsupportcounting)?

10.TheApriorialgorithmusesahashtreedatastructuretoefficientlycountthesupportofcandidateitemsets.Considerthehashtreeforcandidate3-itemsetsshowninFigure5.32 .

Figure5.32.Anexampleofahashtreestructure.

a. Givenatransactionthatcontainsitems{1,3,4,5,8},whichofthehashtreeleafnodeswillbevisitedwhenfindingthecandidatesofthetransaction?

b. Usethevisitedleafnodesinpart(a)todeterminethecandidateitemsetsthatarecontainedinthetransaction{1,3,4,5,8}.

11.Considerthefollowingsetofcandidate3-itemsets:

{1,2,3},{1,2,6},{1,3,4},{2,3,4},{2,4,5},{3,4,6},{4,5,6}

a. Constructahashtreefortheabovecandidate3-itemsets.Assumethetreeusesahashfunctionwhereallodd-numbereditemsarehashedtotheleftchildofanode,whiletheeven-numbereditemsarehashedtotherightchild.Acandidatek-itemsetisinsertedintothetreebyhashingoneachsuccessiveiteminthecandidateandthenfollowingtheappropriatebranchofthetreeaccordingtothehashvalue.Oncealeafnodeisreached,thecandidateisinsertedbasedononeofthefollowingconditions:

Condition1:Ifthedepthoftheleafnodeisequaltok(therootisassumedtobeatdepth0),thenthecandidateisinsertedregardlessofthenumberofitemsetsalreadystoredatthenode.

Condition2:Ifthedepthoftheleafnodeislessthank,thenthecandidatecanbeinsertedaslongasthenumberofitemsetsstoredatthenodeislessthanmaxsize.Assume forthisquestion.

Condition3:Ifthedepthoftheleafnodeislessthankandthenumberofitemsetsstoredatthenodeisequaltomaxsize,thentheleafnodeisconvertedintoaninternalnode.Newleafnodesarecreatedaschildren

maxsize=2

oftheoldleafnode.Candidateitemsetspreviouslystoredintheoldleafnodearedistributedtothechildrenbasedontheirhashvalues.Thenewcandidateisalsohashedtoitsappropriateleafnode.

b. Howmanyleafnodesarethereinthecandidatehashtree?Howmanyinternalnodesarethere?

c. Consideratransactionthatcontainsthefollowingitems:{1,2,3,5,6}.Usingthehashtreeconstructedinpart(a),whichleafnodeswillbecheckedagainstthetransaction?Whatarethecandidate3-itemsetscontainedinthetransaction?

12.GiventhelatticestructureshowninFigure5.33 andthetransactionsgiveninTable5.22 ,labeleachnodewiththefollowingletter(s):

Figure5.33.Anitemsetlattice

Mifthenodeisamaximalfrequentitemset,

Cifitisaclosedfrequentitemset,

Nifitisfrequentbutneithermaximalnorclosed,and

Iifitisinfrequent

Assumethatthesupportthresholdisequalto30%.

13.Theoriginalassociationruleminingformulationusesthesupportandconfidencemeasurestopruneuninterestingrules.

a. DrawacontingencytableforeachofthefollowingrulesusingthetransactionsshowninTable5.23 .

Table5.23.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 {a,b,d,e}

2 {b,c,d}

3 {a,b,d,e}

4 {a,c,d,e}

5 {b,c,d,e}

6 {b,d,e}

7 {c,d}

8 {a,b,c}

9 {a,d,e}

10 {b,d}

Rules: .

b. Usethecontingencytablesinpart(a)tocomputeandranktherulesindecreasingorderaccordingtothefollowingmeasures.

i. Support.

ii. Confidence.

iii. Interest

iv.

v. ,where.

vi.

14.GiventherankingsyouhadobtainedinExercise13,computethecorrelationbetweentherankingsofconfidenceandtheotherfivemeasures.Whichmeasureismosthighlycorrelatedwithconfidence?Whichmeasureisleastcorrelatedwithconfidence?

15.AnswerthefollowingquestionsusingthedatasetsshowninFigure5.34 .Notethateachdatasetcontains1000itemsand10,000transactions.Darkcellsindicatethepresenceofitemsandwhitecellsindicatetheabsenceofitems.WewillapplytheApriorialgorithmtoextractfrequentitemsetswith

(i.e.,itemsetsmustbecontainedinatleast1000transactions).

{b}→{c},{a}→{d},{b}→{d},{e}→{c},{c}→{a}

(X→Y)=P(X,Y)P(X)P(Y).

IS(X→Y)=P(X,Y)P(X)P(Y).

Klosgen(X→Y)=P(X,Y)×max(P(Y|X))−P(Y),P(X|Y)−P(X))P(Y|X)=P(X,Y)P(X)

Oddsratio(X→Y)=P(X,Y)P(X¯,Y¯)P(X,Y¯)P(X¯,Y).

minsup=10%

Figure5.34.FiguresforExercise15.

a. Whichdataset(s)willproducethemostnumberoffrequentitemsets?

b. Whichdataset(s)willproducethefewestnumberoffrequentitemsets?

c. Whichdataset(s)willproducethelongestfrequentitemset?

d. Whichdataset(s)willproducefrequentitemsetswithhighestmaximumsupport?

e. Whichdataset(s)willproducefrequentitemsetscontainingitemswithwide-varyingsupportlevels(i.e.,itemswithmixedsupport,rangingfromlessthan20%tomorethan70%)?

16.

a. Provethatthe coefficientisequalto1ifandonlyif .

b. ShowthatifAandBareindependent,then.

c. ShowthatYule'sQandYcoefficients

arenormalizedversionsoftheoddsratio.

d. WriteasimplifiedexpressionforthevalueofeachmeasureshowninTable5.9 whenthevariablesarestatisticallyindependent.

17.Considertheinterestingnessmeasure, ,foranassociationrule .

ϕ f11=f1+=f+1

P(A,B)×P(A¯,B¯)=P(A,B¯)×P(A¯,B)

Q=[f11f00−f10f01f11f00+f10f01]Y=[f11f00−f10f01f11f00+f10f01]

M=P(B|A)−P(B)1−P(B)A→B

a. Whatistherangeofthismeasure?Whendoesthemeasureattainitsmaximumandminimumvalues?

b. HowdoesMbehavewhenP(A,B)isincreasedwhileP(A)andP(B)remainunchanged?

c. HowdoesMbehavewhenP(A)isincreasedwhileP(A,B)andP(B)remainunchanged?

d. HowdoesMbehavewhenP(B)isincreasedwhileP(A,B)andP(A)remainunchanged?

e. Isthemeasuresymmetricundervariablepermutation?

f. WhatisthevalueofthemeasurewhenAandBarestatisticallyindependent?

g. Isthemeasurenull-invariant?

h. Doesthemeasureremaininvariantunderroworcolumnscalingoperations?

i. Howdoesthemeasurebehaveundertheinversionoperation?

18.Supposewehavemarketbasketdataconsistingof100transactionsand20items.Assumethesupportforitemais25%,thesupportforitembis90%andthesupportforitemset{a,b}is20%.Letthesupportandconfidencethresholdsbe10%and60%,respectively.

a. Computetheconfidenceoftheassociationrule .Istheruleinterestingaccordingtotheconfidencemeasure?

b. Computetheinterestmeasurefortheassociationpattern{a,b}.Describethenatureoftherelationshipbetweenitemaanditembintermsoftheinterestmeasure.

{a}→{b}

c. Whatconclusionscanyoudrawfromtheresultsofparts(a)and(b)?

d. Provethatiftheconfidenceoftherule islessthanthesupportof{b},then:

i.

ii.

where denotetheruleconfidenceand denotethesupportofanitemset.

19.Table5.24 showsa contingencytableforthebinaryvariablesAandBatdifferentvaluesofthecontrolvariableC.

Table5.24.AContingencyTable.

A

1 0

B 1 0 15

0 15 30

B 1 5 0

0 0 15

a. Computethe coefficientforAandBwhen and or1.Notethat .

b. Whatconclusionscanyoudrawfromtheaboveresult?

20.ConsiderthecontingencytablesshowninTable5.25 .

{a}→{b}

c({a¯}→{b})>c({a}→{b}),

c({a¯}→{b})>s({b}),

c(⋅) s(⋅)

2×2×2

C=0

C=1

ϕ C=0,C=1, C=0ϕ=({A,B})=P(A,B)−P(A)P(B)P(A)P(B)(1−P(A))(1−P(B))

a. FortableI,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

b. FortableII,computesupport,theinterestmeasure,andthe correlationcoefficientfortheassociationpattern{A,B}.Also,computetheconfidenceofrules and .

Table5.25.ContingencytablesforExercise20.(a)TableI.

B

A 9 1

1 89

(b)TableII.

B

A 89 1

1 9

c. Whatconclusionscanyoudrawfromtheresultsof(a)and(b)?

21.Considertherelationshipbetweencustomerswhobuyhigh-definitiontelevisionsandexercisemachinesasshowninTables5.17 and5.18 .

a. Computetheoddsratiosforbothtables.

b. Computethe forbothtables.

c. Computetheinterestfactorforbothtables.

ϕ

A→B B→A

ϕ

A→B B→A

ϕ-coefficient

Foreachofthemeasuresgivenabove,describehowthedirectionofassociationchangeswhendataispooledtogetherinsteadofbeingstratified.

6AssociationAnalysis:AdvancedConcepts

Theassociationruleminingformulationdescribedinthepreviouschapterassumesthattheinputdataconsistsofbinaryattributescalleditems.Thepresenceofaniteminatransactionisalsoassumedtobemoreimportantthanitsabsence.Asaresult,anitemistreatedasanasymmetricbinaryattributeandonlyfrequentpatternsareconsideredinteresting.

Thischapterextendstheformulationtodatasetswithsymmetricbinary,categorical,andcontinuousattributes.Theformulationwillalsobeextendedtoincorporatemorecomplexentitiessuchassequencesandgraphs.Althoughtheoverallstructureofassociationanalysisalgorithmsremainsunchanged,certainaspectsofthealgorithmsmustbemodifiedtohandlethenon-traditionalentities.

6.1HandlingCategoricalAttributesTherearemanyapplicationsthatcontainsymmetricbinaryandnominalattributes.TheInternetsurveydatashowninTable6.1 containssymmetricbinaryattributessuchasand ;aswellasnominalattributessuchas

and .Usingassociationanalysis,wemayuncoverinterestinginformationaboutthecharacteristicsofInternetuserssuchas

Table6.1.Internetsurveydatawithcategoricalattributes.

Gender LevelofEducation

State ComputeratHome

ChatOnline

ShopOnline

PrivacyConcerns

Female Graduate Illinois Yes Yes Yes Yes

Male College California No No No No

Male Graduate Michigan Yes Yes Yes Yes

Female College Virginia No No Yes Yes

Female Graduate California Yes No No Yes

Male College Minnesota Yes Yes Yes Yes

Male College Alaska Yes Yes Yes No

Male HighSchool Oregon Yes No No No

Female Graduate Texas No Yes No No

… … … … … … …

ThisrulesuggeststhatmostInternetuserswhoshoponlineareconcernedabouttheirpersonalprivacy.

Toextractsuchpatterns,thecategoricalandsymmetricbinaryattributesaretransformedinto“items”first,sothatexistingassociationruleminingalgorithmscanbeapplied.Thistypeoftransformationcanbeperformedbycreatinganewitemforeachdistinctattribute-valuepair.Forexample,thenominalattribute canbereplacedbythreebinaryitems:

,and .Similarly,symmetricbinaryattributessuchas canbeconvertedintoapairofbinaryitems, and .Table6.2 showstheresultofbinarizingtheInternetsurveydata.

Table6.2.Internetsurveydataafterbinarizingcategoricalandsymmetricbinaryattributes.

Male Female Education=Graduate

Education=College

… Privacy=Yes

Privacy=No

0 1 1 0 … 1 0

1 0 0 1 … 0 1

1 0 1 0 … 1 0

0 1 0 1 … 1 0

0 1 1 0 … 1 0

1 0 0 1 … 1 0

1 0 0 1 … 0 1

1 0 0 0 … 0 1

0 1 1 0 … 0 1

… … … … … … …

Thereareseveralissuestoconsiderwhenapplyingassociationanalysistothebinarizeddata:

1. Someattributevaluesmaynotbefrequentenoughtobepartofafrequentpattern.Thisproblemismoreevidentfornominalattributesthathavemanypossiblevalues,e.g.,statenames.Loweringthesupportthresholddoesnothelpbecauseitexponentiallyincreasesthenumberoffrequentpatternsfound(manyofwhichmaybespurious)andmakesthecomputationmoreexpensive.Amorepracticalsolutionistogrouprelatedattributevaluesintoasmallnumberofcategories.Forexample,eachstatenamecanbereplacedbyitscorrespondinggeographicalregion,suchas ,and .Anotherpossibilityistoaggregatethelessfrequentattributevaluesintoasinglecategorycalled ,asshowninFigure6.1 .

Figure6.1.Apiechartwithamergedcategorycalled .

2. Someattributevaluesmayhaveconsiderablyhigherfrequenciesthanothers.Forexample,suppose85%ofthesurveyparticipantsownahomecomputer.Bycreatingabinaryitemforeachattributevaluethatappearsfrequentlyinthedata,wemaypotentiallygeneratemanyredundantpatterns,asillustratedbythefollowingexample:

Theruleisredundantbecauseitissubsumedbythemoregeneralrulegivenatthebeginningofthissection.Becausethehigh-frequencyitemscorrespondtothetypicalvaluesofanattribute,theyseldomcarryanynewinformationthatcanhelpustobetterunderstandthepattern.Itmaythereforebeusefultoremovesuchitemsbeforeapplying

standardassociationanalysisalgorithms.AnotherpossibilityistoapplythetechniquespresentedinSection5.8 forhandlingdatasetswithawiderangeofsupportvalues.

3. Althoughthewidthofeverytransactionisthesameasthenumberofattributesintheoriginaldata,thecomputationtimemayincreaseespeciallywhenmanyofthenewlycreateditemsbecomefrequent.Thisisbecausemoretimeisneededtodealwiththeadditionalcandidateitemsetsgeneratedbytheseitems(seeExercise1 onpage510).Onewaytoreducethecomputationtimeistoavoidgeneratingcandidateitemsetsthatcontainmorethanoneitemfromthesameattribute.Forexample,wedonothavetogenerateacandidateitemsetsuchas becausethesupportcountoftheitemsetiszero.

6.2HandlingContinuousAttributesTheInternetsurveydatadescribedintheprevioussectionmayalsocontaincontinuousattributessuchastheonesshowninTable6.3 .Miningthecontinuousattributesmayrevealusefulinsightsaboutthedatasuchas“userswhoseannualincomeismorethan$120Kbelongtothe45–60agegroup”or“userswhohavemorethan3emailaccountsandspendmorethan15hoursonlineperweekareoftenconcernedabouttheirpersonalprivacy.”Associationrulesthatcontaincontinuousattributesarecommonlyknownasquantitativeassociationrules.

Table6.3.Internetsurveydatawithcontinuousattributes.

Gender … Age AnnualIncome

No.ofHoursSpentOnlineperWeek

No.ofEmailAccounts

PrivacyConcern

Female … 26 90K 20 4 Yes

Male … 51 135K 10 2 No

Male … 29 80K 10 3 Yes

Female … 45 120K 15 3 Yes

Female … 31 95K 20 5 Yes

Male … 25 55K 25 5 Yes

Male … 37 100K 10 1 No

Male … 41 65K 8 2 No

Female … 26 85K 12 1 No

… … … … … … …

Thissectiondescribesthevariousmethodologiesforapplyingassociationanalysistocontinuousdata.Wewillspecificallydiscussthreetypesofmethods:(1)discretization-basedmethods,(2)statistics-basedmethods,and(3)nondiscretizationmethods.Thequantitativeassociationrulesderivedusingthesemethodsarequitedifferentinnature.

6.2.1Discretization-BasedMethods

Discretizationisthemostcommonapproachforhandlingcontinuousattributes.Thisapproachgroupstheadjacentvaluesofacontinuousattributeintoafinitenumberofintervals.Forexample,the attributecanbedividedintothefollowingintervals: ∈[12,16), ∈[16,20), ∈[20,24),…,

∈[56,60),where[a,b)representsanintervalthatincludesabutnotb.DiscretizationcanbeperformedusinganyofthetechniquesdescribedinSection2.3.6 (equalintervalwidth,equalfrequency,entropy-based,orclustering).Thediscreteintervalsarethenmappedintoasymmetricbinaryattributessothatexistingassociationanalysisalgorithmscanbeapplied.Table6.4 showstheInternetsurveydataafterdiscretizationandbinarization.

Table6.4.Internetsurveydataafterbinarizingcategoricalandcontinuousattributes.

Male Female … Age<13

Age∈[13,21)

Age∈[21,30)

… Privacy=Yes

Privacy=No

0 1 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

1 0 … 0 0 1 … 1 0

0 1 … 0 0 0 … 1 0

0 1 … 0 0 0 … 1 0

1 0 … 0 0 1 … 1 0

1 0 … 0 0 0 … 0 1

1 0 … 0 0 0 … 0 1

0 1 … 0 0 1 … 0 1

… … … … … … … … …

Akeyparameterinattributediscretizationisthenumberofintervalsusedtopartitioneachattribute.Thisparameteristypicallyprovidedbytheusersandcanbeexpressedintermsoftheintervalwidth(fortheequalintervalwidthapproach),theaveragenumberoftransactionsperinterval(fortheequalfrequencyapproach),orthenumberofdesiredclusters(fortheclustering-basedapproach).ThedifficultyindeterminingtherightnumberofintervalscanbeillustratedusingthedatasetshowninTable6.5 ,whichsummarizestheresponsesof250userswhoparticipatedinthesurvey.Therearetwostrongrulesembeddedinthedata:

Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.

AgeGroup ChatOnline=Yes ChatOnline=No

R1:Age∈[16,24)→Chat Online=Yes(s=8.8%, c=81.5%).R2:Age∈[44,60)→Chat

[12,16) 12 13

[16,20) 11 2

[20,24) 11 3

[24,28) 12 13

[28,32) 14 12

[32,36) 15 12

[36,40) 16 14

[40,44) 16 14

[44,48) 4 10

[48,52) 5 11

[52,56) 5 10

[56,60) 4 11

Theserulessuggestthatmostoftheusersfromtheagegroupof16–24oftenparticipateinonlinechatting,whilethosefromtheagegroupof44–60arelesslikelytochatonline.Inthisexample,weconsideraruletobeinterestingonlyifitssupport(s)exceeds5%anditsconfidence(c)exceeds65%.Oneoftheproblemsencounteredwhendiscretizingthe attributeishowtodeterminetheintervalwidth.

1. Iftheintervalistoowide,thenwemaylosesomepatternsbecauseoftheirlackofconfidence.Forexample,whentheintervalwidthis24years, and arereplacedbythefollowingrules:R1 R2R′1:Age∈[12,36)→Chat Online=Yes(s=30%, c=57.7%).R

Despitetheirhighersupports,thewiderintervalshavecausedtheconfidenceforbothrulestodropbelowtheminimumconfidencethreshold.Asaresult,bothpatternsarelostafterdiscretization.

2. Iftheintervalistoonarrow,thenwemaylosesomepatternsbecauseoftheirlackofsupport.Forexample,iftheintervalwidthis4years,then isbrokenupintothefollowingtwosubrules:

Sincethesupportsforthesubrulesarelessthantheminimumsupportthreshold, islostafterdiscretization.Similarly,therule ,whichisbrokenupintofoursubrules,willalsobelostbecausethesupportofeachsubruleislessthantheminimumsupportthreshold.

3. Iftheintervalwidthis8years,thentherule isbrokenupintothefollowingtwosubrules:

Since and havesufficientsupportandconfidence,canberecoveredbyaggregatingbothsubrules.Meanwhile, isbrokenupintothefollowingtwosubrules:

Unlike ,wecannotrecovertherule byaggregatingthesubrulesbecausebothsubrulesfailtheconfidencethreshold.

Onewaytoaddresstheseissuesistoconsidereverypossiblegroupingofadjacentintervals.Forexample,wecanstartwithanintervalwidthof4yearsandthenmergetheadjacentintervalsintowiderintervals: ∈[12,16),∈[12,20),…, ∈[12,60), ∈[16,20), ∈[16,24),etc.This

′2:Age∈[36,60)→Chat Online=No(s=28%, c=58.3%).

R1R11(4):Age∈[16,20)→Chat Online=Yes(s=4.4%, c=84.6%).R12(4):Age∈

R1 R2

R2

R21(8):Age∈[44,52)→Chat Online=No(s=8.4%, c=70%).R22(8):Age∈[52

R21(8) R22(8) R2R1

R11(8):Age∈[12,20)→Chat Online=Yes(s=9.2%, c=60.5%).R12(8):Age∈

R2 R1

approachenablesthedetectionofboth and asstrongrules.However,italsoleadstothefollowingcomputationalissues:

1. Thecomputationbecomesextremelyexpensive.Iftherangeisinitiallydividedintokintervals,then binaryitemsmustbegeneratedtorepresentallpossibleintervals.Furthermore,ifanitemcorrespondingtotheinterval[a,b)isfrequent,thenallotheritemscorrespondingtointervalsthatsubsume[a,b)mustbefrequenttoo.Thisapproachcanthereforegeneratefartoomanycandidateandfrequentitemsets.Toaddresstheseproblems,amaximumsupportthresholdcanbeappliedtopreventthecreationofitemscorrespondingtoverywideintervalsandtoreducethenumberofitemsets.

2. Manyredundantrulesareextracted.Forexample,considerthefollowingpairofrules:

isageneralizationof (and isaspecializationof )becausehasawiderintervalforthe attribute.Iftheconfidencevaluesfor

bothrulesarethesame,then shouldbemoreinterestingbecauseitcoversmoreexamples—includingthosefor . isthereforearedundantrule.

6.2.2Statistics-BasedMethods

Quantitativeassociationrulescanbeusedtoinferthestatisticalpropertiesofapopulation.Forexample,supposeweareinterestedinfindingtheaverageageofcertaingroupsofInternetusersbasedonthedataprovidedinTables

R1 R2

k(k−1)/2

R3:{Age∈[16,20), Gender=Male}→{Chat Online=Yes},R4:{Age∈[16,24), Gender=Male}→{Chat Online=Yes}.

R4 R3 R3 R4R4

R4R3 R3

6.1 and6.3 .Usingthestatistics-basedmethoddescribedinthissection,quantitativeassociationrulessuchasthefollowingcanbeextracted:

TherulestatesthattheaverageageofInternetuserswhoseannualincomeexceeds$100Kandwhoshoponlineregularlyis38yearsold.

RuleGenerationTogeneratethestatistics-basedquantitativeassociationrules,thetargetattributeusedtocharacterizeinterestingsegmentsofthepopulationmustbespecified.Bywithholdingthetargetattribute,theremainingcategoricalandcontinuousattributesinthedataarebinarizedusingthemethodsdescribedintheprevioussection.ExistingalgorithmssuchasAprioriorFP-growtharethenappliedtoextractfrequentitemsetsfromthebinarizeddata.Eachfrequentitemsetidentifiesaninterestingsegmentofthepopulation.Thedistributionofthetargetattributeineachsegmentcanbesummarizedusingdescriptivestatisticssuchasmean,median,variance,orabsolutedeviation.Forexample,theprecedingruleisobtainedbyaveragingtheageofInternetuserswhosupportthefrequentitemset{ >$100K,

}.

Thenumberofquantitativeassociationrulesdiscoveredusingthismethodisthesameasthenumberofextractedfrequentitemsets.Becauseofthewaythequantitativeassociationrulesaredefined,thenotionofconfidenceisnotapplicabletosuchrules.Analternativemethodforvalidatingthequantitativeassociationrulesispresentednext.

RuleValidation

{Annual Income>$100K, Shop Online=Yes}→Age: Mean=38.

Aquantitativeassociationruleisinterestingonlyifthestatisticscomputedfromtransactionscoveredbytherulearedifferentthanthosecomputedfromtransactionsnotcoveredbytherule.Forexample,therulegivenatthebeginningofthissectionisinterestingonlyiftheaverageageofInternetuserswhodonotsupportthefrequentitemset{ >100K,

}issignificantlyhigherorlowerthan38yearsold.Todeterminewhetherthedifferenceintheiraverageagesisstatisticallysignificant,statisticalhypothesistestingmethodsshouldbeapplied.

Considerthequantitativeassociationrule, ,whereAisafrequentitemset,tisthecontinuoustargetattribute,andμistheaveragevalueoftamongtransactionscoveredbyA.Furthermore,let denotetheaveragevalueoftamongtransactionsnotcoveredbyA.Thegoalistotestwhetherthedifferencebetweenμand isgreaterthansomeuser-specifiedthreshold,Δ.Instatisticalhypothesistesting,twooppositepropositions,knownasthenullhypothesisandthealternativehypothesis,aregiven.Ahypothesistestisperformedtodeterminewhichofthesetwohypothesesshouldbeaccepted,basedonevidencegatheredfromthedata(seeAppendixC).

Inthiscase,assumingthat ,thenullhypothesisis ,whilethealternativehypothesisis .Todeterminewhichhypothesisshouldbeaccepted,thefollowingZ-statisticiscomputed:

where isthenumberoftransactionssupportingA, isthenumberoftransactionsnotsupportingA, isthestandarddeviationfortamongtransactionsthatsupportA,and isthestandarddeviationfortamongtransactionsthatdonotsupportA.Underthenullhypothesis,Zhasastandardnormaldistributionwithmean0andvariance1.ThevalueofZcomputedusingEquation6.1 isthencomparedagainstacriticalvalue, ,

A→t:μ

μ′

μ′

μ<μ′ H0:μ′=μ+ΔH1:μ′>μ+Δ

Z=μ′−μ−Δs12n1+s22n2, (6.1)

n1 n2s1s2

whichisathresholdthatdependsonthedesiredconfidencelevel.If ,thenthenullhypothesisisrejectedandwemayconcludethatthequantitativeassociationruleisinteresting.Otherwise,thereisnotenoughevidenceinthedatatoshowthatthedifferenceinmeanisstatisticallysignificant.

Example6.1.Considerthequantitativeassociationrule

Supposethereare50Internetuserswhosupportedtheruleantecedent.Thestandarddeviationoftheiragesis3.5.Ontheotherhand,theaverageageofthe200userswhodonotsupporttheruleantecedentis30andtheirstandarddeviationis6.5.Assumethataquantitativeassociationruleisconsideredinterestingonlyifthedifferencebetweenμand ismorethan5years.UsingEquation6.1 weobtain

Foraone-sidedhypothesistestata95%confidencelevel,thecriticalvalueforrejectingthenullhypothesisis1.64.Since ,thenullhypothesiscanberejected.Wethereforeconcludethatthequantitativeassociationruleisinterestingbecausethedifferencebetweentheaverageagesofuserswhosupportanddonotsupporttheruleantecedentismorethan5years.

6.2.3Non-discretizationMethods

Z>Zα

{Income>100K, Shop Online=Yes}→Age:μ=38.

μ′

Z=38−30−53.5250+6.52200=4.4414.

Z>1.64

Therearecertainapplicationsinwhichanalystsaremoreinterestedinfindingassociationsamongthecontinuousattributes,ratherthanassociationsamongdiscreteintervalsofthecontinuousattributes.Forexample,considertheproblemoffindingwordassociationsintextdocuments.Table6.6 showsadocument-wordmatrixwhereeveryentryrepresentsthenumberoftimesawordappearsinagivendocument.Givensuchadatamatrix,weareinterestedinfindingassociationsbetweenwords(e.g., and )insteadofassociationsbetweenrangesofwordcounts(e.g., ∈[1,4]and ∈[2,3]).Onewaytodothisistotransformthedataintoa0/1matrix,wheretheentryis1ifthecountexceedssomethresholdt,and0otherwise.Whilethisapproachallowsanalyststoapplyexistingfrequentitemsetgenerationalgorithmstothebinarizeddataset,findingtherightthresholdforbinarizationcanbequitetricky.Ifthethresholdissettoohigh,itispossibletomisssomeinterestingassociations.Conversely,ifthethresholdissettoolow,thereisapotentialforgeneratingalargenumberofspuriousassociations.

Table6.6.Document-wordmatrix.

Document

d1 3 6 0 0 0 2

d2 1 2 0 0 0 2

d3 4 2 7 0 0 2

d4 2 0 3 0 0 1

d5 0 0 0 1 1 0

Thissectionpresentsanothermethodologyforfindingassociationsamongcontinuousattributes,knownasthemin-Aprioriapproach.Analogousto

word1 word2 word3 word4 word5 word6

traditionalassociationanalysis,anitemsetisconsideredtobeacollectionofcontinuousattributes,whileitssupportmeasuresthedegreeofassociationamongtheattributes,acrossmultiplerowsofthedatamatrix.Forexample,acollectionofwordsinTable6.6 canbereferredtoasanitemset,whosesupportisdeterminedbytheco-occurrenceofwordsacrossdocuments.Inmin-Apriori,theassociationamongattributesinagivenrowofthedatamatrixisobtainedbytakingtheminimumvalueoftheattributes.Forexample,theassociationbetweenwords, and ,inthedocument isgivenby

.Thesupportofanitemsetisthencomputedbyaggregatingitsassociationoverallthedocuments.

Thesupportmeasuredefinedinmin-Apriorihasthefollowingdesiredproperties,whichmakesitsuitableforfindingwordassociationsindocuments:

1. Supportincreasesmonotonicallyasthenumberofoccurrencesofawordincreases.

2. Supportincreasesmonotonicallyasthenumberofdocumentsthatcontainthewordincreases.

3. Supporthasananti-monotoneproperty.Forexample,considerapairofitemsets and .Since

.Therefore,supportdecreasesmonotonicallyasthenumberofwordsinanitemsetincreases.

ThestandardApriorialgorithmcanbemodifiedtofindassociationsamongwordsusingthenewsupportdefinition.

word1 word2 d1min(3,6)=3.

s({word1,word2})=min(3,6)+min(1,2)+min(4,2)+min(2,0)=6.

{A,B} {A,B,C}min({A,B})≥min({A,B,C}),s({A,B})≥s({A,B,C})

6.3HandlingaConceptHierarchyAconcepthierarchyisamultilevelorganizationofthevariousentitiesorconceptsdefinedinaparticulardomain.Forexample,inmarketbasketanalysis,aconcepthierarchyhastheformofanitemtaxonomydescribingthe“is-a”relationshipsamongitemssoldatagrocerystore—e.g.,milkisakindoffoodandDVDisakindofhomeelectronicsequipment(seeFigure6.2 ).Concepthierarchiesareoftendefinedaccordingtodomainknowledgeorbasedonastandardclassificationschemedefinedbycertainorganizations(e.g.,theLibraryofCongressclassificationschemeisusedtoorganizelibrarymaterialsbasedontheirsubjectcategories).

Figure6.2.Exampleofanitemtaxonomy.

Aconcepthierarchycanberepresentedusingadirectedacyclicgraph,asshowninFigure6.2 .Ifthereisanedgeinthegraphfromanodeptoanothernodec,wecallptheparentofcandcthechildofp.Forexample,

istheparentof becausethereisadirectededgefromthenode tothenode . iscalledanancestorofX(andXadescendentof )ifthereisapathfromnode tonodeXinthedirectedacyclicgraph.InthediagramshowninFigure6.2 , isanancestorof

and isadescendentof .

Themainadvantagesofincorporatingconcepthierarchiesintoassociationanalysisareasfollows:

1. Itemsatthelowerlevelsofahierarchymaynothaveenoughsupporttoappearinanyfrequentitemset.Forexample,althoughthesaleofACadaptorsanddockingstationsmaybelow,thesaleoflaptopaccessories,whichistheirparentnodeintheconcepthierarchy,maybehigh.Also,rulesinvolvinghigh-levelcategoriesmayhavelowerconfidencethantheonesgeneratedusinglow-levelcategories.Unlesstheconcepthierarchyisused,thereisapotentialtomissinterestingpatternsatdifferentlevelsofcategories.

2. Rulesfoundatthelowerlevelsofaconcepthierarchytendtobeoverlyspecificandmaynotbeasinterestingasrulesatthehigherlevels.Forexample,stapleitemssuchasmilkandbreadtendtoproducemanylow-levelrulessuchas

,and .Usingaconcepthierarchy,theycanbesummarizedintoasinglerule, .Consideringonlyitemsresidingatthetopleveloftheirhierarchiesalsomaynotbegoodenough,becausesuchrulesmaynotbeofanypracticaluse.Forexample,althoughtherule maysatisfythesupportandconfidencethresholds,itisnotinformativebecausethe

X^X^ X^

combinationofelectronicsandfooditemsthatarefrequentlypurchasedbycustomersareunknown.Ifmilkandbatteriesaretheonlyitemssoldtogetherfrequently,thenthepattern{ }mayhaveovergeneralizedthesituation.

Standardassociationanalysiscanbeextendedtoincorporateconcepthierarchiesinthefollowingway.Eachtransactiontisinitiallyreplacedwithitsextendedtransaction ,whichcontainsalltheitemsintalongwiththeircorrespondingancestors.Forexample,thetransaction{ }canbeextendedto{

},where and aretheancestorsof ,while and aretheancestorsof .

Wecanthenapplyexistingalgorithms,suchasApriori,tothedatabaseofextendedtransactions.Althoughsuchanapproachwouldfindrulesthatspandifferentlevelsoftheconcepthierarchy,itwouldsufferfromseveralobviouslimitationsasdescribedbelow:

1. Itemsresidingatthehigherlevelstendtohavehighersupportcountsthanthoseresidingatthelowerlevelsofaconcepthierarchy.Asaresult,ifthesupportthresholdissettoohigh,thenonlypatternsinvolvingthehigh-levelitemsareextracted.Ontheotherhand,ifthethresholdissettoolow,thenthealgorithmgeneratesfartoomanypatterns(mostofwhichmaybespurious)andbecomescomputationallyinefficient.

2. Introductionofaconcepthierarchytendstoincreasethecomputationtimeofassociationanalysisalgorithmsbecauseofthelargernumberofitemsandwidertransactions.Thenumberofcandidatepatternsandfrequentpatternsgeneratedbythesealgorithmsmayalsogrowexponentiallywithwidertransactions.

t′

3. Introductionofaconcepthierarchymayproduceredundantrules.Arule isredundantifthereexistsamoregeneralrule ,where isanancestorofX, isanancestorofY,andbothruleshaveverysimilarconfidence.Forexample,suppose{ }→{ },{ }→{2% },{ }→{2% },{ }→{skim },and{ }→{ }haveverysimilarconfidence.Therulesinvolvingitemsfromthelowerlevelofthehierarchyareconsideredredundantbecausetheycanbesummarizedbyaruleinvolvingtheancestoritems.Anitemsetsuchas{

}isalsoredundantbecause and areancestorsof.Fortunately,itiseasytoeliminatesuchredundantitemsets

duringfrequentitemsetgeneration,giventheknowledgeofthehierarchy.

X→Y X^→Y^X^ Y^

6.4SequentialPatternsMarketbasketdataoftencontainstemporalinformationaboutwhenanitemwaspurchasedbycustomers.Suchinformationcanbeusedtopiecetogetherthesequenceoftransactionsmadebyacustomeroveracertainperiodoftime.Similarly,event-baseddatacollectedfromscientificexperimentsorthemonitoringofphysicalsystems,suchastelecommunicationsnetworks,computernetworks,andwirelesssensornetworks,haveaninherentsequentialnaturetothem.Thismeansthatanordinalrelation,usuallybasedontemporalprecedence,existsamongeventsoccurringinsuchdata.However,theconceptsofassociationpatternsdiscussedsofaremphasizeonly“co-occurrence”relationshipsanddisregardthesequentialinformationofthedata.Thelatterinformationmaybevaluableforidentifyingrecurringfeaturesofadynamicsystemorpredictingfutureoccurrencesofcertainevents.Thissectionpresentsthebasicconceptofsequentialpatternsandthealgorithmsdevelopedtodiscoverthem.

6.4.1Preliminaries

Theinputtotheproblemofdiscoveringsequentialpatternsisasequencedataset,anexampleofwhichisshownontheleft-handsideofFigure6.3 .Eachrowrecordstheoccurrencesofeventsassociatedwithaparticularobjectatagiventime.Forexample,thefirstrowcontainsthesetofeventsoccurringattimestamp forobjectA.Notethatifweonlyconsiderthelastcolumnofthissequencedataset,itwouldlooksimilartoamarketbasketdatawhereeveryrowrepresentsatransactioncontainingasetofevents(items).Thetraditionalconceptofassociationpatternsinthisdatawouldcorrespond

t=10

tocommonco-occurrencesofeventsacrosstransactions.However,asequencedatasetalsocontainsinformationabouttheobjectandthetimestampofatransactionofeventsinthefirsttwocolumns.Thesecolumnsaddcontexttoeverytransaction,whichenablesadifferentstyleofassociationanalysisforsequencedatasets.Theright-handsideofFigure6.3 showsadifferentrepresentationofthesequencedatasetwheretheeventsassociatedwitheveryobjectappeartogether,sortedinincreasingorderoftheirtimestamps.Inasequencedataset,wecanlookforassociationpatternsofeventsthatcommonlyoccurinasequentialorderacrossobjects.Forexample,inthesequencedatashowninFigure6.3 ,event6isfollowedbyevent1inallofthesequences.Notethatsuchapatterncannotbeinferredifwetreatthisasamarketbasketdatabyignoringinformationabouttheobjectandtimestamp.

Figure6.3.Exampleofasequencedatabase.

Beforepresentingamethodologyforfindingsequentialpatterns,weprovideabriefdescriptionofsequencesandsubsequences.

SequencesGenerallyspeaking,asequenceisanorderedlistofelements(transactions).Asequencecanbedenotedas ,whereeachelement isacollectionofoneormoreevents(items),i.e., .Thefollowingisalistofexamplesofsequences:

Sequenceofwebpagesviewedbyawebsitevisitor:

⟨{Homepage}{Electronics}{CamerasandCamcorders}{DigitalCameras}{ShoppingCart}{OrderConfirmation}{ReturntoShopping}⟩SequenceofeventsleadingtothenuclearaccidentatThree-MileIsland:

⟨{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}⟩Sequenceofclassestakenbyacomputersciencemajorstudentindifferentsemesters:

⟨{AlgorithmsandDataStructures,IntroductiontoOperatingSystems}{DatabaseSystems,ComputerArchitecture}{ComputerNetworks,SoftwareEngineering}{ComputerGraphics,ParallelProgramming}⟩

Asequencecanbecharacterizedbyitslengthandthenumberofoccurringevents.Thelengthofasequencecorrespondstothenumberofelementspresentinthesequence,whilewerefertoasequencethatcontainskeventsasak-sequence.Thewebsequenceinthepreviousexamplecontains7elementsand7events;theeventsequenceatThree-MileIslandcontains8

s=⟨e1e2e3…en⟩ ejej={i1,i2,…,ik}

elementsand8events;andtheclasssequencecontains4elementsand8events.

Figure6.4 providesexamplesofsequences,elements,andeventsdefinedforavarietyofapplicationdomains.Exceptforthelastrow,theordinalattributeassociatedwitheachofthefirstthreedomainscorrespondstocalendartime.Forthelastrow,theordinalattributecorrespondstothelocationofthebases(A,C,G,T)inthegenesequence.Althoughthediscussiononsequentialpatternsisprimarilyfocusedontemporalevents,itcanbeextendedtothecasewheretheeventshavenon-temporalordering,suchasspatialordering.

Figure6.4.Examplesofelementsandeventsinsequencedatasets.

SubsequencesAsequencetisasubsequenceofanothersequencesifitispossibletoderivetfromsbysimplydeletingsomeeventsfromelementsinsorevendeletingsomeelementsinscompletely.Formally,thesequenceisasubsequenceof ifthereexistintegerssuchthat .Iftisasubsequenceofs,thenwesaythattiscontainedins.Table6.7 givesexamplesillustratingtheideaofsubsequencesforvarioussequences.

Table6.7.Examplesillustratingtheconceptofasubsequence.

Sequence,s Sequence,t Istasubsequenceofs?

Yes

Yes

No

Yes

6.4.2SequentialPatternDiscovery

LetDbeadatasetthatcontainsoneormoredatasequences.Thetermdatasequencereferstoanorderedlistofelementsassociatedwithasingledataobject.Forexample,thedatasetshowninFigure6.5 containsfivedatasequences,oneforeachobjectA,B,C,D,andE.

t=⟨t1t2…tm⟩ s=⟨s1s2…sn⟩ 1≤j1<j2<⋯<jm≤n

t1⊆sj1,t2⊆sj2,…,tm⊆sjm

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {3,6} {8}⟩

⟨{2,4} {3,5,6} {8}⟩ ⟨{2} {8}⟩

⟨{1,2} {3,4}⟩ ⟨{1} {2}⟩

⟨{2,4} {2,4} {2,5}⟩ ⟨{2} {4}⟩

Figure6.5.Sequentialpatternsderivedfromadatasetthatcontainsfivedatasequences.

Thesupportofasequencesisthefractionofalldatasequencesthatcontains.Ifthesupportforsisgreaterthanorequaltoauser-specifiedthresholdminsup,thensisdeclaredtobeasequentialpattern(orfrequentsequence).

Definition6.1(SequentialPatternDiscovery).GivenasequencedatasetDandauser-specifiedminimumsupportthresholdminsup,thetaskofsequentialpatterndiscoveryistofindallsequenceswith .support≥minsup

InFigure6.5 ,thesupportforthesequence isequalto80%becauseitiscontainedinfourofthefivedatasequences(everyobjectexceptforD).Assumingthattheminimumsupportthresholdis50%,anysequencethatiscontainedinatleastthreedatasequencesisconsideredtobeasequentialpattern.Examplesofsequentialpatternsextractedfromthegivendatasetinclude ,etc.

Sequentialpatterndiscoveryisacomputationallychallengingtaskbecausethesetofallpossiblesequencesthatcanbegeneratedfromacollectionofeventsisexponentiallylargeanddifficulttoenumerate.Forexample,acollectionofneventscanresultinthefollowingexamplesof1-sequences,2-sequences,and3-sequences:

1-sequences:

2-sequences:

3-sequences:

TheaboveenumerationissimilarinsomewaystotheitemsetlatticeintroducedinChapter5 formarketbasketdata.However,notethattheaboveenumerationisnotexhaustive;itonlyshowssomesequencesandomitsalargenumberofremainingonesbytheuseofellipses(…).Thisisbecausethenumberofcandidatesequencesissubstantiallylargerthanthenumberofcandidateitemsets,whichmakestheirenumerationdifficult.Therearethreereasonsfortheadditionalnumberofcandidatessequences:

1. Anitemcanappearatmostonceinanitemset,butaneventcanappearmorethanonceinasequence,indifferentelementsofthe

⟨{1}{2}⟩

⟨{1}{2}⟩,⟨{1,2}⟩,⟨{2,3}⟩,⟨{1,2}{2,3}⟩

⟨i1⟩,⟨i2⟩,…,⟨in⟩

⟨{i1,i2}⟩,⟨{i1,i3}⟩,…,⟨{in−1,in}⟩,…⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,…,⟨{in}{in}⟩

⟨{i1,i2,i3}⟩,⟨{i1,i2,i4}⟩,…,⟨{in−2,in−1,in}⟩,…⟨{i1}{i1,i2}⟩,⟨{i1}{i1,i3}⟩,…,⟨{in−1}{in−1,in}⟩,…⟨{i1,i2}{i2}⟩,⟨{i1,i2}{i3}⟩,…,⟨{in−1,in}{in}⟩,…⟨{i1}{i1}{i1}⟩,⟨{i1}{i1}{i2}⟩,…,⟨{in}{in}{in}⟩

sequence.Forexample,givenapairofitems, and ,onlyonecandidate2-itemset, ,canbegenerated.Incontrast,therearemanycandidate2-sequencesthatcanbegeneratedusingonlytwoevents: ,and .

2. Ordermattersinsequences,butnotforitemsets.Forexample,and referstothesameitemset,whereas ,and correspondtodifferentsequences,andthusmustbegeneratedseparately.

3. Formarketbasketdata,thenumberofdistinctitemsnputsanupperboundonthenumberofcandidateitemsets ,whereasforsequencedata,eventwoeventsaandbcanleadtoinfinitelymanycandidatesequences(seeFigure6.6 foranillustration).

Figure6.6.Comparingthenumberofitemsetswiththenumberofsequencesgeneratedusingtwoevents(items).Weonlyshow1-sequences,2-sequences,and3-sequencesforillustration.

Becauseoftheabovereasons,itischallengingtocreateasequencelatticethatenumeratesallpossiblesequencesevenwhenthenumberofeventsinthedataissmall.Itisthusdifficulttouseabrute-forceapproachfor

i1 i2{i1,i2}

⟨{i1}{i1}⟩,⟨{i1}{i2}⟩,⟨{i2}{i1}⟩,⟨{i2}{i2}⟩ ⟨{i1,i2}⟩{i1,i2}

{i2,i1} ⟨{i1}{i2}⟩,⟨{i2}{i1}⟩⟨{i1,i2}⟩

(2n−1)

generatingsequentialpatternsthatenumeratesallpossiblesequencesbytraversingthesequencelattice.Despitethesechallenges,theAprioriprinciplestillholdsforsequentialdatabecauseanydatasequencethatcontainsaparticulark-sequencemustalsocontainallofits -subsequences.Aswewillseelater,eventhoughitischallengingtoconstructthesequencelattice,itispossibletogeneratecandidatek-sequencesfromthefrequent -sequencesusingtheAprioriprinciple.ThisallowsustoextractsequentialpatternsfromasequencedatasetusinganApriori-likealgorithm.ThebasicstructureofthisalgorithmisshowninAlgorithm6.1 .

Algorithm6.1Apriori-likealgorithmfor

sequentialpatterndiscovery.

(k−1)

(k−1)

1: .2: . {Findallfrequent1-subsequences.}3:repeat4: .5: . {Generatecandidatek-subsequences.}6: . {Prunecandidatek-subsequences.}7:foreachdatasequence do8: . {Identifyallcandidatescontainedint.}9:foreachcandidatek-subsequence do10: . {Incrementthesupportcount.}11:endfor12:endfor

k=1Fk={i|i∈I∧σ({i})N≥minsup}

k=k+1Ck=candidate-gen(Fk−1)

Ck=candidate-prune(Ck,Fk−1)

t∈TCt=subsequence(Ck,t)

c∈Ctσ(c)=σ(c)+1

NoticethatthestructureofthealgorithmisalmostidenticaltoApriorialgorithmforfrequentitemsetdiscovery,presentedinthepreviouschapter.Thealgorithmwoulditerativelygeneratenewcandidatek-sequences,prunecandidateswhose -sequencesareinfrequent,andthencountthesupportsoftheremainingcandidatestoidentifythesequentialpatterns.Thedetailedaspectsofthesestepsaregivennext.

CandidateGeneration

Wegeneratecandidatek-sequencesbymergingapairoffrequent -sequences.Althoughthisapproachissimilartothe strategyintroducedinChapter5 forgeneratingcandidateitem-sets,therearecertaindifferences.First,inthecaseofgeneratingsequences,wecanmergea -sequencewithitselftoproduceak-sequence,since.Forexample,wecanmergethe1-sequence withitselftoproduceacandidate2-sequence, .Second,recallthatinordertoavoidgeneratingduplicatecandidates,thetraditionalApriorialgorithmmergesapairoffrequentk-itemsetsonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Inthecaseofgeneratingsequences,westillusethelexicographicorderforarrangingeventswithinanelement.However,thearrangementofelementsinasequencemaynotfollowthelexicographicorder.Forexample,

isaviablerepresentationofa4-sequence,eventhoughtheelementsinthesequencearenotarrangedaccordingtotheirlexicographicranks.Ontheotherhand, isnotaviablerepresentationofthesame4-sequence,sincetheeventsinthefirstelementviolatethelexicographicorder.

13: . {Extractthefrequentk-subsequences.}14:until15: .

Fk={c|c∈Ck∧σ(c)N≥minsup}

Fk=∅Answer=∪Fk

(k−1)

(k−1)Fk−1×Fk−1

(k−1)⟨a⟩

⟨{a}{a}⟩

k−1

⟨{b,c}{a}{d}⟩

⟨{c,b}{a}{d}⟩

Givenasequence ,wheretheeventsineveryelementarearrangedlexicographically,wecanrefertofirsteventof asthefirsteventofsandthelasteventof asthelasteventofs.Thecriteriaformergingsequencescanthenbestatedintheformofthefollowingprocedure.

SequenceMergingProcedure

Asequence ismergedwithanothersequence onlyifthesubsequenceobtainedbydroppingthefirsteventin isidenticaltothesubsequenceobtainedbydroppingthelasteventin .Theresultingcandidateisgivenbyextendingthesequence asfollows:

1. Ifthelastelementof hasonlyoneevent,appendthelastelementof totheendof andobtainthemergedsequence.

2. Ifthelastelementof hasmorethanoneevent,appendthelasteventfromthelastelementof (thatisabsentinthelastelementof )tothelastelementof andobtainthemergedsequence.

Figure6.7 illustratesexamplesofcandidate4-sequencesobtainedbymergingpairsoffrequent3-sequences.Thefirstcandidate, ,isobtainedbymerging with .Sincethelastelementofthesecondsequence containsonlyoneelement,itissimplyappendedtothefirstsequencetogeneratethemergedsequence.Ontheotherhand,merging with producesthecandidate4-sequence

.Inthiscase,thelastelementofthesecondsequence

s=⟨e1e2e3…en⟩e1

en

s(1) s(2)s(1)

s(2)s(1)

s(2)s(2) s(1)

s(2)s(2)

s(1) s(1)

⟨{1}{2}{3}{4}⟩⟨{1}{2}{3}⟩ ⟨{2}{3}{4}⟩({4})

⟨{1}{5}{3}⟩ ⟨{5}{3,4}⟩ ⟨{1}{5}{3,4}⟩ ({3,4})

containstwoevents.Hence,thelasteventinthiselement(4)isaddedtothelastelementofthefirstsequence toobtainthemergedsequence.

Figure6.7.Exampleofthecandidategenerationandpruningstepsofasequentialpatternminingalgorithm.

Itcanbeshownthatthesequencemergingprocedureiscomplete,i.e.,itgenerateseveryfrequentk-subsequence.Thisisbecauseeveryfrequentk-subsequencesincludesafrequent -sequence ,thatdoesnotcontainthefirsteventofs,andafrequent -sequence ,thatdoesnotcontainthelasteventofs.Since and arefrequentandfollowthecriteriaformergingsequences,theywillbemergedtoproduceeveryfrequentk-subsequencesasoneofthecandidates.Also,thesequencemergingprocedureensuresthatthereisauniquewayofgeneratingsonlybymergingand .Forexample,inFigure6.7 ,thesequences and

donothavetobemergedbecauseremovingthefirsteventfromthefirstsequencedoesnotgivethesamesubsequenceasremovingthelasteventfromthesecondsequence.Although isaviablecandidate,itisgeneratedbymergingadifferentpairofsequences,

({3})

(k−1) s1(k−1) s2

s1 s2

s1 s2 ⟨{1}{2}{3}⟩ ⟨{1}{2,5}⟩

⟨{1}{2,5}{3}⟩⟨{1}{2,5}

and .Thisexampleillustratesthatthesequencemergingproceduredoesnotgenerateduplicatecandidatesequences.

CandidatePruning

Acandidatek-sequenceisprunedifatleastoneofits -sequencesisinfrequent.Forexample,considerthecandidate4-sequence .Weneedtocheckifanyofthe3-sequencescontainedinthis4-sequenceisinfrequent.Sincethesequence iscontainedinthissequenceandisinfrequent,thecandidate canbeeliminated.Readersshouldbeabletoverifythattheonlycandidate4-sequencethatsurvivesthecandidatepruningstepinFigure6.7 is .

SupportCounting

Duringsupportcounting,thealgorithmidentifiesallcandidatek-sequencesbelongingtoaparticulardatasequenceandincrementstheirsupportcounts.Afterperformingthisstepforeachdatasequence,thealgorithmidentifiesthefrequentk-sequencesanddiscardsallcandidatesequenceswhosesupportvaluesarelessthantheminsupthreshold.

6.4.3TimingConstraints*

Thissectionpresentsasequentialpatternformulationwheretimingconstraintsareimposedontheeventsandelementsofapattern.Tomotivatetheneedfortimingconstraints,considerthefollowingsequenceofcoursestakenbytwostudentswhoenrolledinadataminingclass:

⟩ ⟨{2,5}{3}⟩

(k−1)⟨{1}{2}{3}{4}⟩

⟨{1}{2}{4}⟩⟨{1}{2}{3}{4}⟩

⟨{1}{2 5}{3}⟩

StudentA:⟨{Statistics}{DatabaseSystems}{DataMining}⟩.StudentB:⟨

Thesequentialpatternofinterestis,whichmeansthatstudentswhoareenrolledinthedata

miningclassmusthavepreviouslytakenacourseinstatisticsanddatabasesystems.Clearly,thepatternissupportedbybothstudentseventhoughtheydonottakestatisticsanddatabasesystemsatthesametime.Incontrast,astudentwhotookastatisticscoursetenyearsearliershouldnotbeconsideredassupportingthepatternbecausethetimegapbetweenthecoursesistoolong.Becausetheformulationpresentedintheprevioussectiondoesnotincorporatethesetimingconstraints,anewsequentialpatterndefinitionisneeded.

Figure6.8 illustratessomeofthetimingconstraintsthatcanbeimposedonapattern.Thedefinitionoftheseconstraintsandtheimpacttheyhaveonsequentialpatterndiscoveryalgorithmswillbediscussedinthefollowingsections.Notethateachelementofthesequentialpatternisassociatedwithatimewindow ,wherelistheearliestoccurrenceofaneventwithinthetimewindowanduisthelatestoccurrenceofaneventwithinthetimewindow.Notethatinthisdiscussion,wealloweventswithinanelementtooccuratdifferenttimes.Hence,theactualtimingoftheeventoccurrencemaynotbethesameasthelexicographicordering.

{DatabaseSystems}{Statistics}{DataMining}⟩.

⟨{Statistics,DatabaseSystems}{DataMining}⟩

[l,u]

Figure6.8.Timingconstraintsofasequentialpattern.

ThemaxspanConstraintThemaxspanconstraintspecifiesthemaximumallowedtimedifferencebetweenthelatestandtheearliestoccurrencesofeventsintheentiresequence.Forexample,supposethefollowingdatasequencescontainelementsthatoccuratconsecutivetimestamps ,i.e.,theelementinthesequenceoccursatthe timestamp.Assumingthatmaxspan=3,thefollowingtablecontainssequentialpatternsthataresupportedandnotsupportedbyagivendatasequence.

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

Yes

(1,2,3,…) ithith

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{4}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

No

Ingeneral,thelongerthemaxspan,themorelikelyitistodetectapatterninadatasequence.However,alongermaxspancanalsocapturespuriouspatternsbecauseitincreasesthechancefortwounrelatedeventstobetemporallyrelated.Inaddition,thepatternmayinvolveeventsthatarealreadyobsolete.

Themaxspanconstraintaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.Asshownintheprecedingexamples,somedatasequencesnolongersupportacandidatepatternwhenthemaxspanconstraintisimposed.IfwesimplyapplyAlgorithm6.1 ,thesupportcountsforsomepatternsmaybeoverestimated.Toavoidthisproblem,thealgorithmmustbemodifiedtoignorecaseswheretheintervalbetweenthefirstandlastoccurrencesofeventsinagivenpatternisgreaterthanmaxspan.

ThemingapandmaxgapConstraintsTimingconstraintscanalsobespecifiedtorestrictthetimedifferencebetweentwoconsecutiveelementsofasequence.Ifthemaximumtimedifference(maxgap)isoneweek,theneventsinoneelementmustoccurwithinaweek’stimeoftheeventsoccurringinthepreviouselement.Iftheminimumtimedifference(mingap)iszero,theneventsinoneelementmustoccuraftertheeventsoccurringinthepreviouselement.(SeeFigure6.8 .)Thefollowingtableshowsexamplesofpatternsthatpassorfailthemaxgapandmingapconstraints,assumingthatmaxgap=3andmingap=1.Theseexamplesassumeeachelementoccursatconsecutivetimesteps.

DataSequence,s SequentialPattern,t maxgap mingap

Pass Pass

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3}{6}⟩

Pass Fail

Fail Pass

Fail Fail

Aswithmaxspan,theseconstraintswillaffectthesupportcountingstepofsequentialpatterndiscoveryalgorithmsbecausesomedatasequencesnolongersupportacandidatepatternwhenmingapandmaxgapconstraintsarepresent.Thesealgorithmsmustbemodifiedtoensurethatthetimingconstraintsarenotviolatedwhencountingthesupportofapattern.Otherwise,someinfrequentsequencesmaymistakenlybedeclaredasfrequentpatterns.

AsideeffectofusingthemaxgapconstraintisthattheAprioriprinciplemightbeviolated.Toillustratethis,considerthedatasetshowninFigure6.5 .Withoutmingapormaxgapconstraints,thesupportfor and

arebothequalto60%.However,ifmingap=0andmaxgap=1,thenthesupportfor reducesto40%,whilethesupportfor isstill60%.Inotherwords,supporthasincreasedwhenthenumberofeventsinasequenceincreases—whichcontradictstheAprioriprinciple.TheviolationoccursbecausetheobjectDdoesnotsupportthepattern sincethetimegapbetweenevents2and5isgreaterthanmaxgap.Thisproblemcanbeavoidedbyusingtheconceptofacontiguoussubsequence.

Definition6.2(ContiguousSubsequence).Asequencesisacontiguoussubsequenceof ifanyoneofthefollowingconditionshold:

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3}{6}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1}{3}{8}⟩

⟨{2}{5}⟩ ⟨{2}{3}{5}⟩

⟨{2}{5}⟩ ⟨{2}{3}{5}⟩

⟨{2}{5}⟩

w=⟨e1e2…ek⟩

1. sisobtainedfromwafterdeletinganeventfromeither or ,2. sisobtainedfromwafterdeletinganeventfromanyelement that

containsatleasttwoevents,or3. sisacontiguoussubsequenceoftandtisacontiguoussubsequence

ofw.

Thefollowingexamplesillustratetheconceptofacontiguoussubsequence:

DataSequence,s SequentialPattern,t Istacontiguoussubsequenceofs?

Yes

Yes

Yes

No

No

Usingtheconceptofcontiguoussubsequences,theAprioriprinciplecanbemodifiedtohandlemaxgapconstraintsinthefollowingway.

Definition6.3(ModifiedAprioriPrinciple).Ifak-sequenceisfrequent,thenallofitscontiguous -subsequencesmustalsobefrequent.

e1 ekei∈w

⟨{1}{2,3}⟩ ⟨{1}{2}⟩

⟨{1,2}{2}{3}⟩ ⟨{1}{2}⟩

⟨{3,4}{1,2}{2,3}{4}⟩ ⟨{1}{2}⟩

⟨{1}{3}{2}⟩ ⟨{1}{2}⟩

⟨{1,2}{1}{3}{2}⟩ ⟨{1}{2}⟩

k−1

ThemodifiedAprioriprinciplecanbeappliedtothesequentialpatterndiscoveryalgorithmwithminormodifications.Duringcandidatepruning,notallk-sequencesneedtobeverifiedsincesomeofthemmayviolatethemaxgapconstraint.Forexample,ifmaxgap=1,itisnotnecessarytocheckwhetherthesubsequence ofthecandidate isfrequentsincethetimedifferencebetweenelements and isgreaterthanonetimeunit.Instead,onlythecontiguoussubsequencesofneedtobeexamined.Thesesubsequencesinclude ,

, ,and .

TheWindowSizeConstraintFinally,eventswithinanelement donothavetooccuratthesametime.Awindowsizethreshold(ws)canbedefinedtospecifythemaximumallowedtimedifferencebetweenthelatestandearliestoccurrencesofeventsinanyelementofasequentialpattern.Awindowsizeof0meansalleventsinthesameelementofapatternmustoccursimultaneously.

Thefollowingexampleuses todeterminewhetheradatasequencesupportsagivensequence(assuming , ,and

).

DataSequence,s SequentialPattern,t Doesssupportt?

Yes

Yes

No

No

⟨{1}{2,3}{5}⟩ ⟨{1}{2,3}{4}{5}⟩⟨{2,3}⟩ {5}

⟨{1}{2,3}{4}{5}⟩⟨{1}{2,3}{4}⟩ ⟨{2,3}{4}

{5}⟩ ⟨{1}{2}{4}{5}⟩ {1}{3}{4}{5}

sj

ws=2mingap=0 maxgap=3

maxspan=∞

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4}{5}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{3,4,6}{8}⟩

⟨{1,3}{3,4}{4}{5}{6,7}{8}⟩ ⟨{1,3,4}{6,7,8}⟩

Inthelastexample,althoughthepattern satisfiesthewindowsizeconstraint,itviolatesthemaxgapconstraintbecausethemaximumtimedifferencebetweeneventsinthetwoelementsis5units.Thewindowsizeconstraintalsoaffectsthesupportcountingstepofsequentialpatterndiscoveryalgorithms.IfAlgorithm6.1 isappliedwithoutimposingthewindowsizeconstraint,thesupportcountsforsomeofthecandidatepatternsmightbeunderestimated,andthussomeinterestingpatternsmaybelost.

6.4.4AlternativeCountingSchemes*

Therearemultiplewaysofdefiningasequencegivenadatasequence.Forexample,ifourdatabaseinvolveslongsequencesofevents,wemightbeinterestedinfindingsubsequencesthatoccurmultipletimesinthesamedatasequence.Hence,insteadofcountingthesupportofasubsequenceasthenumberofdatasequencesitiscontainedin,wecanalsotakeintoaccountthenumberoftimesasubsequenceiscontainedinadatasequence.Thisviewpointgivesrisetoseveraldifferentformulationsforcountingthesupportofacandidatek-sequencefromadatabaseofsequences.Forillustrativepurposes,considertheproblemofcountingthesupportforsequence,asshowninFigure6.9 .Assumethat , , ,and

.

⟨{1,3,4}{6,7,8}⟩

⟨{p}{q}⟩ ws=0 mingap=0 maxgap=2maxspan=2

Figure6.9.Comparingdifferentsupportcountingmethods.

COBJ:Oneoccurrenceperobject.

Thismethodlooksforatleastoneoccurrenceofagivensequenceinanobject’stimeline.InFigure6.9 ,eventhoughthesequenceappearsseveraltimesintheobject’stimeline,itiscountedonlyonce—withpoccurringat andqoccurringat .CWIN:Oneoccurrenceperslidingwindow.

⟨(p)(q)⟩

t=1 t=3

Inthisapproach,aslidingtimewindowoffixedlength(maxspan)ismovedacrossanobject’stimeline,oneunitatatime.Thesupportcountisincrementedeachtimethesequenceisencounteredintheslidingwindow.InFigure6.9 ,thesequence isobservedsixtimesusingthismethod.CMINWIN:Numberofminimalwindowsofoccurrence.

Aminimalwindowofoccurrenceisthesmallestwindowinwhichthesequenceoccursgiventhetimingconstraints.Inotherwords,aminimalwindowisthetimeintervalsuchthatthesequenceoccursinthattimeinterval,butitdoesnotoccurinanyofthepropersubintervalsofit.ThisdefinitioncanbeconsideredasarestrictiveversionofCWIN,becauseitseffectistoshrinkandcollapsesomeofthewindowsthatarecountedbyCWIN.Forexample,sequence hasfourminimalwindowoccurrences:(1)thepair ,(2)thepair ,(3)thepair

,and(4)thepair .Theoccurrenceofeventpatandeventqat isnotaminimalwindowoccurrencebecauseit

containsasmallerwindowwith ,whichisindeedaminimalwindowofoccurrence.CDIST_O:Distinctoccurrenceswithpossibilityofevent-timestampoverlap.

Adistinctoccurrenceofasequenceisdefinedtobethesetofevent-timestamppairssuchthattherehastobeatleastonenewevent-timestamppairthatisdifferentfromapreviouslycountedoccurrence.CountingallsuchdistinctoccurrencesresultsintheCDIST_Omethod.Iftheoccurrencetimeofeventspandqisdenotedasatuple ,thenthismethodyieldseightdistinctoccurrencesofsequence attimes(1,3),(2,3),(2,4),(3,4),(3,5),(5,6),(5,7),and(6,7).CDIST:Distinctoccurrenceswithnoevent-timestampoverlapallowed.

⟨{p}{q}⟩

⟨{p}{q}⟩(p:t=2,q:t=3) (p:t=3,q:t=4)

(p:t=5,q:t=6) (p:t=6,q:t=7)t=1 t=3

(p:t=2,q:t=3)

(t(p),t(q))⟨{p}{q}⟩

InCDIST_Oabove,twooccurrencesofasequencewereallowedtohaveoverlappingevent-timestamppairs,e.g.,theoverlapbetween(1,3)and(2,3).IntheCDISTmethod,nooverlapisallowed.Effectively,whenanevent-timestamppairisconsideredforcounting,itismarkedasusedandisneverusedagainforsubsequentcountingofthesamesequence.Asanexample,therearefivedistinct,non-overlappingoccurrencesofthesequence inthediagramshowninFigure6.9 .Theseoccurrenceshappenattimes(1,3),(2,4),(3,5),(5,6),and(6,7).ObservethattheseoccurrencesaresubsetsoftheoccurrencesobservedinCDIST_O.

Onefinalpointregardingthecountingmethodsistheneedtodeterminethebaselineforcomputingthesupportmeasure.Forfrequentitemsetmining,thebaselineisgivenbythetotalnumberoftransactions.Forsequentialpatternmining,thebaselinedependsonthecountingmethodused.FortheCOBJmethod,thetotalnumberofobjectsintheinputdatacanbeusedasthebaseline.FortheCWINandCMINWINmethods,thebaselineisgivenbythesumofthenumberoftimewindowspossibleinallobjects.FormethodssuchasCDISTandCDIST_O,thebaselineisgivenbythesumofthenumberofdistincttimestampspresentintheinputdataofeachobject.

⟨{p}{q}⟩

6.5SubgraphPatternsThissectiondescribestheapplicationofassociationanalysismethodstographs,whicharemorecomplexentitiesthanitemsetsandsequences.Anumberofentitiessuchaschemicalcompounds,3-Dproteinstructures,computernetworks,andtreestructuredXMLdocumentscanbemodeledusingagraphrepresentation,asshowninTable6.8 .

Table6.8.Graphrepresentationofentitiesinvariousapplicationdomains.

Application Graphs Vertices Edges

Webmining Collectionofinter-linkedWebpages

Webpages Hyperlinkbetweenpages

Computationalchemistry

Chemicalcompounds Atomsorions Bondbetweenatomsorions

Computersecurity

Computernetworks Computersandservers

Interconnectionbetweenmachines

SemanticWeb XMLdocuments XMLelements Parent-childrelationshipbetweenelements

Bioinformatics 3-DProteinstructures Aminoacids Contactresidue

Ausefuldataminingtasktoperformonthistypeofdataistoderiveasetoffrequentlyoccurringsubstructuresinacollectionofgraphs.Suchataskisknownasfrequentsubgraphmining.Apotentialapplicationoffrequentsubgraphminingcanbeseeninthecontextofcomputationalchemistry.Each

year,newchemicalcompoundsaredesignedforthedevelopmentofpharmaceuticaldrugs,pesticides,fertilizers,etc.Althoughthestructureofacompoundisknowntoplayamajorroleindeterminingitschemicalproperties,itisdifficulttoestablishtheirexactrelationship.Frequentsubgraphminingcanaidthisundertakingbyidentifyingthesubstructurescommonlyassociatedwithcertainpropertiesofknowncompounds.Suchinformationcanhelpscientiststodevelopnewchemicalcompoundsthathavecertaindesiredproperties.

Thissectionpresentsamethodologyforapplyingassociationanalysistograph-baseddata.Thesectionbeginswithareviewofsomeofthebasicgraph-relatedconceptsanddefinitions.Thefrequentsubgraphminingproblemisthenintroduced,followedbyadescriptionofhowthetraditionalApriorialgorithmcanbeextendedtodiscoversuchpatterns.

6.5.1Preliminaries

GraphsAgraphisadatastructurethatcanbeusedtorepresentrelationshipsamongasetofentities.Mathematically,agraph iscomposedofavertexsetandasetofedges connectingpairsofvertices.Eachedgeisdenotedby

avertexpair ,where .Alabel canbeassignedtoeachvertex representingthenameofanentity.Similarlyeachedge canalsobeassociatedwithalabel describingtherelationshipbetweenapairofentities.Table6.8 showstheverticesandedgesassociatedwithdifferenttypesofgraphs.Forexample,inawebgraph,theverticescorrespondtowebpagesandtheedgesrepresentthehyperlinksbetweenwebpages.

G=(V,E)V E

(vi,vj) vi,vj∈V l(vi)vi (vi,vj)

l(vi,vj)

Althoughthesizeofagraphcangenerallyberepresentedeitherbythenumberofitsverticesoritsedges,inthischapter,wewillconsiderthesizeofagraphasitsnumberofedges.Further,wewilldenoteagraphwithkedgesasak-graph.

GraphIsomorphismAbasicprimitivethatisneededtoworkwithgraphsistodecideiftwographswiththesamenumberofverticesandedgesareequivalenttoeachother,i.e.,representthesamestructureofrelationshipsamongentities.Graphisomorphismprovidesaformaldefinitionofgraphequivalencethatservesasabuildingblockforcomputingsimilaritiesamonggraphs.

Definition6.4(GraphIsomorphism).Twographs and areisomorphictoeachother(denotedas )ifthereexistsfunctions,

and ,thatmapeveryvertexandedge,respectively,from to ,suchthatthefollowingpropertiesaresatisfied:

1. Edge-preservingproperty:Twovertices and informanedgein ifandonlyifthevertices and

formanedgein2. Label-preservingproperty:Thelabelsoftwovertices

and in areequalifandonlyifthelabelsof andin areequal.Similarly,thelabelsoftwoedgesand in areequalifandonlyifthelabelsof

and areequal.

G1=(V1,E1) G2=(V2,E2)G1≃G2 fv:V1→

V2 fe:E1→E2G1 G2

va vb G1G1 fv(va) fv(

vb) G2va

vb G1 fv(va)fv(vb) G2 (va,vb) (vc,vd) G1fe(va,vb) fe(vc,vd)

Themappingfunctions constitutetheisomorphismbetweenthegraphsand .Thisisdenotedas : .Anautomorphismisaspecial

typeofisomorphismwhereagraphismappeduntoitself,i.e., and.Figure6.10 showsanexampleofagraphautomorphismwheretheset

ofvertexlabelsinbothgraphsis .Eventhoughbothgraphslookdifferent,theyareactuallyisomorphictoeachotherbecausethereisaone-toonemappingbetweentheverticesandedgesofbothgraphs.Sincethesamegraphcanbedepictedinmultipleforms,detectinggraphautomorphismisanon-trivialproblem.Acommonapproachtosolvingthisproblemistoassignacanonicallabeltoeverygraph,suchthateveryautomorphismofagraphsharesthesamecanonicallabel.Canonicallabelscanalsohelpinarranginggraphsinaparticular(canonical)orderandcheckingforduplicates.Techniquesforconstructingcanonicallabelsarenotcoveredinthischapter,butinterestedreadersmayconsulttheBibliographicNotesattheendofthischapterformoredetails.

Figure6.10.

(fv,fe)G1 G2 (fv,fe) G1→G2

V1=V2 EE2

{A,B}

Graphisomorphism

Subgraphs

Definition6.5(Subgraph).Agraph isasubgraphofanothergraph ifitsvertexset isasubsetofVanditsedgeset isasubsetofE,suchthattheendpointsofeveryedgein iscontainedin.Thesubgraphrelationshipisdenotedas .

Example6.2.Figure6.11 showsagraphthatcontains6verticesand11edgesalongwithoneofitspossiblesubgraphs.Thesubgraph,whichisshowninFigure6.11(b) ,containsonly4ofthe6verticesand4ofthe11edgesintheoriginalgraph.

G′=(V′,E′) G=(V,E)V′ E′

E′ V′G′⊆SG

Figure6.11.Exampleofasubgraph.

Definition6.6(Support).GivenacollectionofgraphsG,thesupportforasubgraph isdefinedasthefractionofallgraphsthatcontain asitssubgraph,i.e.,

Example6.3.Considerthefivegraphs, through ,showninFigure6.12 ,wherethesetofvertexlabelsrangesfrom to butalltheedgesinthegraphshavethesamelabel.Thegraph shownonthetopright-handdiagramisasubgraphof , , ,and .Therefore, .

gg

s(g)=|{Gi|g⊆SGi,Gi∈g}||G|. (6.2)

G1 G5a e

g1G1 G3 G4 G5 s(g1)=4/5=80%

Similarly,wecanshowthat because isasubgraphof,and ,while because isasubgraphof and

Figure6.12.Computingthesupportofasubgraphfromasetofgraphs.

6.5.2FrequentSubgraphMining

Thissectionpresentsaformaldefinitionofthefrequentsubgraphminingproblemandillustratesthecomplexityofthistask.

s(g2)=60% g2 G1G2 G3 s(g3)=40% g3 G1 G

Definition6.7(FrequentSubgraphMining).GivenasetofgraphsGandasupportthreshold,minsup,thegoaloffrequentsubgraphminingistofindallsubgraphsgsuchthat .

Whilethisformulationisgenerallyapplicabletoanytypeofgraph,thediscussionpresentedinthischapterfocusesprimarilyonundirected,connectedgraphs.Thedefinitionsofthesegraphsaregivenbelow:

1. Agraphisundirectedifitcontainsonlyundirectededges.Anedgeisundirectedifitisindistinguishablefrom .

2. Agraphisconnectedifthereexistsapathbetweeneverypairofverticesinthegraph,inwhichapathisasequenceofvertices

suchthatthereisanedgeconnectingeverypairofadjacentvertices inthesequence.

Methodsforhandlingothertypesofsubgraphs(directedordisconnected)areleftasanexercisetothereaders(seeExercise15 onpage519).

Miningfrequentsubgraphsisacomputationallyexpensivetaskthatismuchmorechallengingthanminingfrequentitemsetsorfrequentsubsequences.Theadditionalcomplexityinfrequentsubgraphminingarisesduetotwomajorreasons.First,computingthesupportofasubgraphggivenacollectionofgraphsGisnotasstraightforwardasforitemsetsorsequences.Thisisbecauseitisanon-trivialproblemtocheckifasubgraphgiscontainedinagraph ,sincethesamegraphgcanbepresentinadifferentforminduetographisomorphism.Theproblemofverifyingifagraphisasubgraph

s(g)≥minsup

(vi,vj) (vj,vi)

⟨v1v2…vk⟩

(vi,vi+1)

g′∈G g′

ofanothergraphisknownasthesubgraphisomorphismproblem,whichisproventobeNP-complete,i.e.,thereisnoknownalgorithmforthisproblemthatrunsinpolynomialtime.

Second,thenumberofcandidatesubgraphsthatcanbegeneratedfromagivensetofvertexandedgelabelsisfarlargerthanthenumberofcandidateitemsetsgeneratedusingtraditionalmarketbasketdatasets.Thisisbecauseofthefollowingreasons:

1. Acollectionofitemsformsauniqueitemsetbutthesamesetofedgelabelscanbearrangedinexponentialnumberofwaysinagraph,withmultiplechoicesofvertexlabelsattheirendpoints.Forexample,itemsp,q,andrformauniqueitemset ,butthreeedgeswithlabelsp,q,andrcanformmultiplegraphs,someexamplesofwhichareshowninFigure6.13 .

Figure6.13.Examplesofgraphsgeneratedusingthreeedgeswithlabelsp,q,andr.

{p,q,r}

2. Anitemcanappearatmostonceinanitemsetbutanedgelabelcanappearmultipletimesinagraph,becausedifferentarrangementsofedgeswiththesameedgelabelrepresentdifferentgraphs.Forexample,anitempcanonlygenerateasinglecandidateitemset,whichistheitemitself.However,usingasingleedgelabelpandvertexlabela,wecangenerateanumberofgraphswithdifferentsizes,asshowninFigure6.14 .

Figure6.14.Graphsofsizesonetothreegeneratedusingasingleedgelabelpandvertexlabela.

Becauseoftheabovereasons,itischallengingtoenumerateallpossiblesubgraphsthatcanbegeneratedusingagivensetofvertexandedgelabels.Figure6.15 showssomeexamplesof1-graphs,2-graphs,and3-graphsthatcanbegeneratedusingvertexlabels andedgelabels .Itcanbeseenthatevenusingtwovertexandedgelabels,enumeratingallpossiblegraphsbecomesdifficultevenforsizetwo.Hence,itishighlyimpracticaltouseabrute-forcemethodforfrequentsubgraphmining,thatenumeratesallpossiblesubgraphsandcountstheirrespectivesupports.

{a,b} {p,q}

Figure6.15.Examplesofgraphsgeneratedusingtwoedgelabels,pandq,andtwovertexlabels,aandb,forsizesvaryingfromonetothree.

However,notethattheAprioriprinciplestillholdsforsubgraphsbecauseak-graphisfrequentonlyifallofits -subgraphsarefrequent.Hence,despitethecomputationalchallengesinenumeratingallpossiblecandidatesubgraphs,wecanusetheAprioriprincipletogeneratecandidatek-subgraphsusingfrequent -subgraphs.Algorithm6.2 presentsagenericApriori-likeapproachforfrequentsubgraphmining.Inthefollowing,webrieflydescribethethreemainstepsofthealgorithm:candidategeneration,candidatepruning,andsupportcounting.

Algorithm6.2Apriori-likealgorithmfor

frequentsubgraphmining.1. ←Findallfrequent1-subgraphsinG2. ←Findallfrequent2-subgraphsinG3. .4. repeat5. .6. =candidate-gen .{Generatecandidatek-subgraphs.}7. =candidate-prune .{Performcandidate

pruning.}8. foreachgraph do9. =subgraph .{Identifyallcandidatescontainedint.}10. foreachcandidatek-subgraph do11. .{Incrementthesupportcount.}12. endfor13. endfor14. .{Extractthefrequentk-

subgraphs.}15. until16. Answer= .

(k−1)

(k−1)

F1F2k=2

k=k+1Ck (Fk−1)Ck (Ck−1,Fk−1)

g∈GCt (Ck,g)

c∈Ctσ(c)=σ(c)+1

Fk={c|c∈Ck∧σ(c)N≥minsup}

Fk=0∪Fk

6.5.3CandidateGeneration

Apairoffrequent -subgraphsaremergedtoformacandidatek-subgraphiftheyshareacommon -subgraph,knownastheircore.Givenacommoncore,thesubgraphmergingprocedurecanbedescribedasfollows:

SubgraphMergingProcedure

Let and betwofrequent -subgraphs.Letconsistofacore andanextraedge ,whereuispartofthecore.ThisisdepictedinFigure6.16(a) ,wherethecoreisrepresentedbyasquareandtheextraedgeisrepresentedbyalinebetweenuand .Similarly,let consistofthecore andtheextraedge, ,asshowninFigure6.16(b) .

Figure6.16.Acompactrepresentationofapairoffrequent -subgraphsconsideredformerging.

(k−1)(k−2)

Gi(k−1) Gj(k−1) (k−1) Gi(k−1)Gi(k−2) (u,u′)

u′ Gj(k−1) Gj(k−2)(v,v′)

(k−1)

Usingthesecores,thetwographsaremergedonlyifthereexistsanautomorphismbetweenthetwocores: Theresultingcandidatesareobtainedbyaddinganedgeto asfollows:

1. If ,i.e.,uismappedtovintheautomorphismbetweenthecores,thengenerateacandidatebyadding to ,asshowninFigure6.17(a) .

Figure6.17.IllustrationofCandidateMergingProcedures.

2. If ,i.e.,uisnotmappedtovbutadifferentvertexw,thengenerateacandidatebyadding to .Additionally,ifthelabelsof and areidentical,thengenerateanothercandidatebyadding to ,asshowninFigure6.17(b) .

(fv,fe):Gi(k−2)→Gj(k−2)Gi(k−1)

fv(u)=v(v,u′) Gj(k−1)

fv(u)=w≠v(w,u′) Gj(k−1)

u′ v′(w,v′) Gi(k−1)

Figure6.18(a) showsthecandidatesubgraphsgeneratedbymergingand .Theshadedverticesandthickerlinesrepresentthecoreverticesandedges,respectively,ofthetwographs,whilethedottedlinesrepresentthemappingbetweenthetwocores.Notethatthisexampleillustratescondition1ofthesubgraphmergingprocedure,sincetheendpointsoftheextraedgesinboththegraphsaremappedtoeachother.Thisresultsinasinglecandidatesubgraph, .Ontheotherhand,Figure6.18(b) showsanexampleofcondition2ofthesubgraphmergingprocedure,wheretheendpointsoftheextraedgesdonotmaptoeachotherandthelabelsofthenewendpointsareidentical.Mergingthetwographs and thusresultsintwosubgraphs,asshownintheFigureas and .

G1G2

G3

G4 G5G6 G7

Figure6.18.Twoexamplesofcandidatek-subgraphgenerationusingapairof -subgraphs.

Theapproachpresentedaboveofmergingtwofrequent -subgraphsissimilartothe candidategenerationstrategyintroducedforitemsetsinChapter5 ,andisguaranteedtoexhaustivelygenerateallfrequentk-subgraphsasviablecandidates(seeExercise18 ).However,

(k−1)

(k−1)Fk−1×Fk−1

thereareseveralnotabledifferencesinthecandidategenerationproceduresofitemsetsandsubgraphs.

1. MergingwithSelf:Unlikeitemsets,afrequent -subgraphcanbemergedwithitselftocreateacandidatek-subgraph.Thisisespeciallyimportantwhenak-graphcontainsrepeatingunitsofedgescontainedina -subgraph.Asanexample,the3-graphsshowninFigure6.14 canonlybegeneratedfromthe2-graphsshowninFigure6.14 ,ifself-mergingisallowed.

2. MultiplicityofCandidates:Asdescribedinthesubgraphmergingprocedure,apairoffrequent -subgraphssharingacommoncorecangeneratemultiplecandidates.Asanexample,ifthelabelsattheendpointsoftheextraedgesareidentical,i.e., ,wewillgeneratetwocandidatesasshowninFigure6.18(b) .Ontheotherhand,mergingapairoffrequentitemsetsorsubsequencesgeneratesauniquecandidateitemsetorsubsequence.

3. MultiplicityofCores:Twofrequent -subgraphscansharemorethanonecoreofsize thatiscommoninboththegraphs.Figure6.19 showsanexampleofapairofgraphsthatsharetwocommoncores.Sinceeverychoiceofacommoncorecanresultinadifferentwayofmergingthetwographs,thiscanpotentiallycontributetothemultiplicityofcandidatesgeneratedbymergingthesamepairofsubgraphs.

(k−1)

(k−1)

(k−1)

l(u′)=l(v′)

(k−1)k−2

Figure6.19.Multiplicityofcoresforthesamepairof -subgraphs.

4. MultiplicityofAutomorphisms:Thecommoncoresofthetwographscanbemappedtoeachotherusingmultiplechoicesofmappingfunctions,eachresultinginadifferentautomorphism.Toillustratethis,Figure6.20 showsapairofgraphsthatshareacommoncoreofsizefour,representedasasquare.Thefirstcorecanexistinthreedifferentforms(rotatedviews),eachresultinginadifferentmappingbetweenthetwocores.Sincethechoiceofthemappingfunctionaffectsthecandidategenerationprocedure,everyautomorphismofthecorecanpotentiallyresultinadifferentsetofcandidates,asshowninFigure6.20 .

(k−1)

Figure6.20.Anexampleshowingmultiplewaysofmappingthecoresoftwo -subgraphswithoneanother.

5. GenerationofDuplicateCandidates:Inthecaseofitemsets,generationofduplicatecandidatesisavoidedbytheuseoflexicographicordering,suchthattwofrequentk-itemsetsaremergedonlyiftheirfirst items,arrangedinlexicographicorder,areidentical.Unfortunately,inthecaseofsubgraphs,theredoesnotexistanotionoflexicographicorderingamongtheverticesoredgesofagraph.Hence,thesamecandidatek-subgraphcanbegeneratedbymergingtwodifferentpairsof -subgraphs.Figure6.21 showsan

(k−1)

k−1

k−1

exampleofacandidate4-subgraphthatcanbegeneratedintwodifferentways,usingdifferentpairsoffrequent3-subgraphs.Thus,itisnecessarytocheckforduplicatesandeliminatetheredundantgraphsduringcandidatepruning.

Figure6.21.Differentpairsof -subgraphscangeneratethesamecandidatek-subgraph,thusresultinginduplicatecandidates.

Algorithm6.3 presentsthecompleteprocedureforgeneratingthesetofallcandidatek-subgraphs, ,usingthesetoffrequent -subgraphs, .Weconsidermergingeverypairofsubgraphsin ,includingpairsinvolvingthesamesubgraphtwice(toensureself-merging).Foreverypairof

-subgraphs,weconsiderallpossibleconnectedcoresofsize ,thatcanbeconstructedfromthetwographsbyremovinganedgefromeachgraph.Ifthetwocoresareisomorphic,weconsiderallpossiblemappingsbetweentheverticesandedgesofthetwocores.Foreverysuchmapping,we

(k−1)

Ck (k−1) Fk−1Fk−1

(k−1) k−2

employthesubgraphmergingproceduretoproducecandidatek-subgraphs,thatareaddedto .

Algorithm6.3Procedureforcandidate

generation:candidate-gen .1. .2. foreachpair, and , do3. {Consideringallpairsoffrequent -subgraphsformerging.}4. foreachpair, and do5. {Findingallcommoncoresbetweenapairoffrequent –

subgraphs.}6. .{Removinganedgefrom .}7. .{Removinganedgefrom .}8. if AND and areconnected

graphsthen9. { and arecommoncoresof and ,

respectively.}10. foreach do11. {Generatingcandidatesforeveryautomorphismbetweenthe

cores.}12. subgraph-merge .13. endfor14. endif15. endfor16. endfor17. Answer= .

Ck

(Fk−1)Ck=0

Gi(k−1)∈Fk−1 Gj(k−1)∈Fk−1 i≤j(k−1)

ei∈Gi(k−1) ej∈Gj(k−1)(k−1)

Gi(k−2)=Gi(k−1)ei Gi(k−1)Gj(k−2)=Gj(k−1)ej Gj(k−1)Gi(k−2)≃Gj(k−2) Gi(k−2) Gj(k−2)

Gi(k−2) Gj(k−2) Gi(k−1) Gj(k−1)

(fv,fe):Gi(k−2)→Gj(k−2)

Ck=Ck∪ (Gi(k−2),Gj(k−2),fv,fe,ei,ej)

Ck

6.5.4CandidatePruning

Afterthecandidatek-subgraphsaregenerated,thecandidateswhose -subgraphsareinfrequentneedtobepruned.Thepruningstepcanbeperformedbyidentifyingallpossibleconnected -subgraphsthatcanbeconstructedbyremovingoneedgefromacandidatek-subgraphandthencheckingiftheyhavealreadybeenidentifiedasfrequent.Ifanyofthe -subgraphsareinfrequent,thecandidatek-subgraphisdiscarded.Also,duplicatecandidatesneedtobedetectedandeliminated.Thiscanbedonebycomparingthecanonicallabelsofcandidategraphs,sincethecanonicallabelsofduplicategraphswillbeidentical.Canonicallabelscanalsohelpincheckingifa -subgraphcontainedinacandidatek-subgraphisfrequentornot,bymatchingitscanonicallabelwiththatofeveryfrequent -subgraphin .

6.5.5SupportCounting

Supportcountingisalsoapotentiallycostlyoperationbecauseallthecandidatesubgraphscontainedineachgraph mustbedetermined.OnewaytospeedupthisoperationistomaintainalistofgraphIDsassociatedwitheachfrequent -subgraph.Wheneveranewcandidatek-subgraphisgeneratedbymergingapairoffrequent -subgraphs,theircorrespondinglistsofgraphIDsareintersected.Finally,thesubgraphisomorphismtestsareperformedonthegraphsintheintersectedlisttodeterminewhethertheycontainaparticularcandidatesubgraph.

(k−1)

(k−1)

(k−1)

(k−1)(k−1)

Fk−1

G∈G

(k−1)(k−1)

6.6InfrequentPatterns*Theassociationanalysisformulationdescribedsofarisbasedonthepremisethatthepresenceofaniteminatransactionismoreimportantthanitsabsence.Asaconsequence,patternsthatarerarelyfoundinadatabaseareoftenconsideredtobeuninterestingandareeliminatedusingthesupportmeasure.Suchpatternsareknownasinfrequentpatterns.

Definition6.8(InfrequentPattern).Aninfrequentpatternisanitemsetorarulewhosesupportislessthantheminsupthreshold.

Althoughavastmajorityofinfrequentpatternsareuninteresting,someofthemmightbeusefultotheanalysts,particularlythosethatcorrespondtonegativecorrelationsinthedata.Forexample,thesaleof sand stogetherislowbecauseanycustomerwhobuysa willmostlikelynotbuya ,andviceversa.Suchnegative-correlatedpatternsareusefultohelpidentifycompetingitems,whichareitemsthatcanbesubstitutedforoneanother.Examplesofcompetingitemsincludeteaversuscoffee,butterversusmargarine,regularversusdietsoda,anddesktopversuslaptopcomputers.

Someinfrequentpatternsmayalsosuggesttheoccurrenceofinterestingrareeventsorexceptionalsituationsinthedata.Forexample,if is

frequentbut isinfrequent,thenthelatterisaninterestinginfrequentpatternbecauseitmayindicatefaultyalarmsystems.Todetectsuchunusualsituations,theexpectedsupportofapatternmustbedetermined,sothat,ifapatternturnsouttohaveaconsiderablylowersupportthanexpected,itisdeclaredasaninterestinginfrequentpattern.

Mininginfrequentpatternsisachallengingendeavorbecausethereisanenormousnumberofsuchpatternsthatcanbederivedfromagivendataset.Morespecifically,thekeyissuesinmininginfrequentpatternsare:(1)howtoidentifyinterestinginfrequentpatterns,and(2)howtoefficientlydiscovertheminlargedatasets.Togetsomeperspectiveonvarioustypesofinterestinginfrequentpatterns,tworelatedconcepts—negativepatternsandnegativelycorrelatedpatterns—areintroducedinSections6.6.1 and6.6.2 ,respectively.TherelationshipsamongthesepatternsareelucidatedinSection6.6.3 .Finally,twoclassesoftechniquesdevelopedformininginterestinginfrequentpatternsarepresentedinSections6.6.5 and6.6.6 .

6.6.1NegativePatterns

Let beasetofitems.Anegativeitem, denotestheabsenceofitem fromagiventransaction.Forexample, isanegativeitemwhosevalueis1ifatransactiondoesnotcontain .

Definition6.9(NegativeItemset).

I={i1,i2,…,id} ik¯ik

AnegativeitemsetXisanitemsetthathasthefollowingproperties:(1) ,whereAisasetofpositiveitems, isasetofnegativeitems, 1,and(2) minsup.

Definition6.10(NegativeAssociationRule).Anegativeassociationruleisanassociationrulethathasthefollowingproperties:(1)theruleisextractedfromanegativeitemset,(2)thesupportoftheruleisgreaterthanorequaltominsup,and(3)theconfidenceoftheruleisgreaterthanorequaltominconf.

Thenegativeitemsetsandnegativeassociationrulesarecollectivelyknownasnegativepatternsthroughoutthischapter.Anexampleofanegativeassociationruleis ,whichmaysuggestthatpeoplewhodrinkteatendtonotdrinkcoffee.

6.6.2NegativelyCorrelatedPatterns

Section5.7.1 onpage402describedhowcorrelationanalysiscanbeusedtoanalyzetherelationshipbetweenapairofcategoricalvariables.Measuressuchasinterestfactor(Equation5.5 )andtheφ-coefficient(Equation

X=A∪B¯ B¯|B¯|≥1 s(X)≥

5.8 )wereshowntobeusefulfordiscoveringitemsetsthatarepositivelycorrelated.Thissectionextendsthediscussiontonegativelycorrelatedpatterns.

Definition6.11(NegativelyCorrelatedItemset).AnitemsetX,whichisdefinedas ,isnegativelycorrelatedif

where isthesupportoftheitemx.

Notethatthesupportofanitemsetisanestimateoftheprobabilitythatatransactioncontainstheitemset.Hence,theright-handsideoftheprecedingexpression, ,representsanestimateoftheprobabilitythatalltheitemsinXarestatisticallyindependent.Definition6.11 suggeststhatanitemsetisnegativelycorrelatedifitssupportisbelowtheexpectedsupportcomputedusingthestatisticalindependenceassumption.Thesmaller ,themorenegativelycorrelatedisthepattern.

Definition6.12(NegativelyCorrelatedAssociationRule).

X={x1,x2,…,xk}

s(X)<∏j=1ks(xj)=s(x1)×s(x2)×…×s(xk), (6.3)

s(x)

∏j=1ks(xj)

s(X)

Anassociationrule isnegativelycorrelatedif

whereXandYaredisjointitemsets;i.e., .

TheprecedingdefinitionprovidesonlyapartialconditionfornegativecorrelationbetweenitemsinXanditemsinY.Afullconditionfornegativecorrelationcanbestatedasfollows:

where and .BecausetheitemsinX(andinY)areoftenpositivelycorrelated,itismorepracticaltousethepartialconditiontodefineanegativelycorrelatedassociationruleinsteadofthefullcondition.Forexample,althoughtherule

isnegativelycorrelatedaccordingtoInequality6.4 , ispositivelycorrelatedwith and ispositivelycorrelatedwith

.IfInequality6.5 isappliedinstead,sucharulecouldbemissedbecauseitmaynotsatisfythefullconditionfornegativecorrelation.

Theconditionfornegativecorrelationcanalsobeexpressedintermsofthesupportforpositiveandnegativeitemsets.Let and denotethecorrespondingnegativeitemsetsforXandY,respectively.Since

X→Y

s(X∪Y)<s(X)s(Y), (6.4)

X∪Y=0

s(X∪Y)<∏is(xi)∏js(yi), (6.5)

xi∈X yi∈Y

X¯ Y¯

theconditionfornegativecorrelationcanbestatedasfollows:

Thenegativelycorrelateditemsetsandassociationrulesareknownasnegativelycorrelatedpatternsthroughoutthischapter.

6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns

Infrequentpatterns,negativepatterns,andnegativelycorrelatedpatternsarethreecloselyrelatedconcepts.Althoughinfrequentpatternsandnegativelycorrelatedpatternsreferonlytoitemsetsorrulesthatcontainpositiveitems,whilenegativepatternsrefertoitemsetsorrulesthatcontainbothpositiveandnegativeitems,therearecertaincommonalitiesamongtheseconcepts,asillustratedinFigure6.22 .

s(X∪Y)−s(X)s(Y)=s(X∪Y)−[s(X∪Y)+s(X∪Y¯)][s(X∪Y)+s(X¯∪Y)]=s(X∪Y)[1−s(X∪Y)−s(X∪Y¯)−s(X¯∪Y)]−s(X∪Y¯)s(X¯∪Y)=s(X∪Y)s(X¯∪Y¯)−s(X∪Y¯)s(X¯∪Y),

s(X∪Y)s(X¯∪Y¯)<s(X∪Y¯)s(X¯∪Y). (6.6)

Figure6.22.Comparisonsamonginfrequentpatterns,negativepatterns,andnegativelycorrelatedpatterns.

First,notethatmanyinfrequentpatternshavecorrespondingnegativepatterns.Tounderstandwhythisisthecase,considerthecontingencytableshowninTable6.9 .If isinfrequent,thenitislikelytohaveacorrespondingnegativeitemsetunlessminsupistoohigh.Forexample,assumingthat ,if isinfrequent,thenthesupportforatleastoneofthefollowingitemsets, , ,or ,mustbehigherthanminsupsincethesumofthesupportsinacontingencytableis1.

Table6.9.Atwo-waycontingencytablefortheassociationrule .

Y

X

X∪Y

minsup≤0.25 X∪YX∪Y¯ X¯∪Y X¯∪Y¯

X→Y

s(X∪Y)

s(X¯∪Y)

s(X∪Y¯)

s(X¯∪Y¯)

s(X)

s(X¯)

1

Second,notethatmanynegativelycorrelatedpatternsalsohavecorrespondingnegativepatterns.ConsiderthecontingencytableshowninTable6.9 andtheconditionfornegativecorrelationstatedinInequality6.6 .IfXandYhavestrongnegativecorrelation,then

Therefore,either or ,orboth,musthaverelativelyhighsupportwhenXandYarenegativelycorrelated.Theseitemsetscorrespondtothenegativepatterns.Finally,becausethelowerthesupportof ,themorenegativelycorrelatedisthepattern,infrequentpatternstendtobestrongernegativelycorrelatedpatternsthanfrequentones.

6.6.4TechniquesforMiningInterestingInfrequentPatterns

Inprinciple,infrequentitemsetsaregivenbyallitemsetsthatarenotextractedbystandardfrequentitemsetgenerationalgorithmssuchasAprioriandFP-growth.TheseitemsetscorrespondtothoselocatedbelowthefrequentitemsetbordershowninFigure6.23 .

s(Y) s(Y¯)

s(X∪Y¯)×s(X¯∪Y)≫s(X∪Y)×s(X¯∪Y¯).

X∪Y¯ X¯∪Y

X∪Y

Figure6.23.Frequentandinfrequentitemsets.

Sincethenumberofinfrequentpatternscanbeexponentiallylarge,especiallyforsparse,high-dimensionaldata,techniquesdevelopedformininginfrequentpatternsfocusonfindingonlyinterestinginfrequentpatterns.AnexampleofsuchpatternsincludesthenegativelycorrelatedpatternsdiscussedinSection6.6.2 .ThesepatternsareobtainedbyeliminatingallinfrequentitemsetsthatfailthenegativecorrelationconditionprovidedinInequality6.3 .Thisapproachcanbecomputationallyintensivebecausethesupportsforallinfrequentitemsetsmustbecomputedinordertodeterminewhethertheyarenegativelycorrelated.Unlikethesupportmeasureusedforminingfrequentitemsets,correlation-basedmeasuresusedforminingnegativelycorrelateditemsetsdonotpossessananti-monotonepropertythatcanbe

exploitedforpruningtheexponentialsearchspace.Althoughanefficientsolutionremainselusive,severalinnovativemethodshavebeendeveloped,asmentionedintheBibliographicNotesprovidedattheendofthischapter.

Theremainderofthischapterpresentstwoclassesoftechniquesformininginterestinginfrequentpatterns.Section6.6.5 describesmethodsforminingnegativepatternsindata,whileSection6.6.6 describesmethodsforfindinginterestinginfrequentpatternsbasedonsupportexpectation.

6.6.5TechniquesBasedonMiningNegativePatterns

Thefirstclassoftechniquesdevelopedformininginfrequentpatternstreatseveryitemasasymmetricbinaryvariable.UsingtheapproachdescribedinSection6.1 ,thetransactiondatacanbebinarizedbyaugmentingitwithnegativeitems.Figure6.24 showsanexampleoftransformingtheoriginaldataintotransactionshavingbothpositiveandnegativeitems.ByapplyingexistingfrequentitemsetgenerationalgorithmssuchasAprioriontheaugmentedtransactions,allthenegativeitemsetscanbederived.

Figure6.24.Augmentingadatasetwithnegativeitems.

Suchanapproachisfeasibleonlyifafewvariablesaretreatedassymmetricbinary(i.e.,welookfornegativepatternsinvolvingthenegationofonlyasmallnumberofitems).Ifeveryitemmustbetreatedassymmetricbinary,theproblembecomescomputationallyintractableduetothefollowingreasons.

1. Thenumberofitemsdoubleswheneveryitemisaugmentedwithitscorrespondingnegativeitem.Insteadofexploringanitemsetlatticeofsize ,wheredisthenumberofitemsintheoriginaldataset,thelatticebecomesconsiderablylarger,asshowninExercise22 onpage522.

2. Support-basedpruningisnolongereffectivewhennegativeitemsareaugmented.Foreachvariablex,eitherxor hassupportgreaterthanorequalto50%.Hence,evenifthesupportthresholdisashighas50%,halfoftheitemswillremainfrequent.Forlowerthresholds,manymoreitemsandpossiblyitemsetscontainingthemwillbefrequent.Thesupport-basedpruningstrategyemployedbyAprioriiseffectiveonlywhenthesupportformostitemsetsislow;otherwise,thenumberoffrequentitemsetsgrowsexponentially.

2d

3. Thewidthofeachtransactionincreaseswhennegativeitemsareaugmented.Supposethereareditemsavailableintheoriginaldataset.Forsparsedatasetssuchasmarketbaskettransactions,thewidthofeachtransactiontendstobemuchsmallerthand.Asaresult,themaximumsizeofafrequentitemset,whichisboundedbythemaximumtransactionwidth, ,tendstoberelativelysmall.Whennegativeitemsareincluded,thewidthofthetransactionsincreasestodbecauseanitemiseitherpresentinthetransactionorabsentfromthetransaction,butnotboth.Sincethemaximumtransactionwidthhasgrownfrom tod,thiswillincreasethenumberoffrequentitemsetsexponentially.Asaresult,manyexistingalgorithmstendtobreakdownwhentheyareappliedtotheextendeddataset.

Thepreviousbrute-forceapproachiscomputationallyexpensivebecauseitforcesustodeterminethesupportforalargenumberofpositiveandnegativepatterns.Insteadofaugmentingthedatasetwithnegativeitems,anotherapproachistodeterminethesupportofthenegativeitemsetsbasedonthesupportoftheircorrespondingpositiveitems.Forexample,thesupportfor

canbecomputedinthefollowingway:

Moregenerally,thesupportforanyitemset canbeobtainedasfollows:

ToapplyEquation6.7 , mustbedeterminedforeveryZthatisasubsetofY.ThesupportforanycombinationofXandZthatexceedstheminsupthresholdcanbefoundusingtheApriorialgorithm.Forallothercombinations,thesupportsmustbedeterminedexplicitly,e.g.,byscanningtheentiresetoftransactions.Anotherpossibleapproachistoeitherignorethe

wmax

wmax

{p,q¯,r¯}

s({p,q¯,r¯})=s({p})−s({p,q})−s({p,r})+s({p,q,r}).

X∪Y¯

s(X∪Y¯)=s(X)+∑i=1n∑Z⊂Y,|Z|=i{(−1)i×s(X∪Z)}. (6.7)

s(X∪Z)

supportforanyinfrequentitemset ortoapproximateitwiththeminsupthreshold.

Severaloptimizationstrategiesareavailabletofurtherimprovetheperformanceoftheminingalgorithms.First,thenumberofvariablesconsideredassymmetricbinarycanberestricted.Morespecifically,anegativeitem isconsideredinterestingonlyifyisafrequentitem.Therationaleforthisstrategyisthatrareitemstendtoproducealargenumberofinfrequentpatternsandmanyofwhichareuninteresting.Byrestrictingtheset

giveninEquation6.7 tovariableswhosepositiveitemsarefrequent,thenumberofcandidatenegativeitemsetsconsideredbytheminingalgorithmcanbesubstantiallyreduced.Anotherstrategyistorestrictthetypeofnegativepatterns.Forexample,thealgorithmmayconsideronlyanegativepattern ifitcontainsatleastonepositiveitem(i.e., ).Therationaleforthisstrategyisthatifthedatasetcontainsveryfewpositiveitemswithsupportgreaterthan50%,thenmostofthenegativepatternsoftheform

willbecomefrequent,thusdegradingtheperformanceoftheminingalgorithm.

6.6.6TechniquesBasedonSupportExpectation

Anotherclassoftechniquesconsidersaninfrequentpatterntobeinterestingonlyifitsactualsupportisconsiderablysmallerthanitsexpectedsupport.Fornegativelycorrelatedpatterns,theexpectedsupportiscomputedbasedonthestatisticalindependenceassumption.Thissectiondescribestwoalternativeapproachesfordeterminingtheexpectedsupportofapattern

X∪Z

X∪Y¯ |X|≥1

X¯∪Y¯

using(1)aconcepthierarchyand(2)aneighborhood-basedapproachknownasindirectassociation.

SupportExpectationBasedonConceptHierarchyObjectivemeasuresalonemaynotbesufficienttoeliminateuninterestinginfrequentpatterns.Forexample,suppose and arefrequentitems.Eventhoughtheitemset isinfrequentandperhapsnegativelycorrelated,itisnotinterestingbecausetheirlackofsupportseemsobvioustodomainexperts.Therefore,asubjectiveapproachfordeterminingexpectedsupportisneededtoavoidgeneratingsuchinfrequentpatterns.

Intheprecedingexample, and belongtotwocompletelydifferentproductcategories,whichiswhyitisnotsurprisingtofindthattheirsupportislow.Thisexamplealsoillustratestheadvantageofusingdomainknowledgetopruneuninterestingpatterns.Formarketbasketdata,thedomainknowledgecanbeinferredfromaconcepthierarchysuchastheoneshowninFigure6.25 .Thebasicassumptionofthisapproachisthatitemsfromthesameproductfamilyareexpectedtohavesimilartypesofinteractionwithotheritems.Forexample,since and belongtothesameproductfamily,weexpecttheassociationbetween and tobesomewhatsimilartotheassociationbetween and .Iftheactualsupportforanyoneofthesepairsislessthantheirexpectedsupport,thentheinfrequentpatternisinteresting.

Figure6.25.Exampleofaconcepthierarchy.

Toillustratehowtocomputetheexpectedsupport,considerthediagramshowninFigure6.26 .Supposetheitemset isfrequent.Letdenotetheactualsupportofapatternand denoteitsexpectedsupport.TheexpectedsupportforanychildrenorsiblingsofCandGcanbecomputedusingtheformulashownbelow.

{C,G} s(⋅)ϵ(⋅)

Figure6.26.Mininginterestingnegativepatternsusingaconcepthierarchy.

Forexample,if and arefrequent,thentheexpectedsupportbetween and canbecomputedusingEquation6.8becausetheseitemsarechildrenof and ,respectively.Iftheactualsupportfor and isconsiderablylowerthantheirexpectedvalue,then and formaninterestinginfrequentpattern.

SupportExpectationBasedonIndirectAssociationConsiderapairofitems,(a,b),thatarerarelyboughttogetherbycustomers.Ifaandbareunrelateditems,suchas and player,thentheirsupportisexpectedtobelow.Ontheotherhand,ifaandbarerelateditems,thentheirsupportisexpectedtobehigh.Theexpectedsupportwaspreviouslycomputedusingaconcepthierarchy.Thissectionpresentsanapproachfordeterminingtheexpectedsupportbetweenapairofitemsbylookingatotheritemscommonlypurchasedtogetherwiththesetwoitems.

Forexample,supposecustomerswhobuya alsotendtobuyothercampingequipment,whereasthosewhobuya also

ϵ(s(E,J))=s(C,G)×s(E)s(C)×s(J)s(G) (6.8)

ϵ(s(C,J))=s(C,G)×s(J)s(G) (6.9)

ϵ(s(C,H))=s(C,G)×s(H)s(G) (6.10)

tendtobuyothercomputeraccessoriessuchasanopticalmouseoraprinter.Assumingthereisnootheritemfrequentlyboughttogetherwithbothasleepingbagandadesktopcomputer,thesupportfortheseunrelateditemsisexpectedtobelow.Ontheotherhand,suppose and areoftenboughttogetherwith and .Evenwithoutusingaconcepthierarchy,bothitemsareexpectedtobesomewhatrelatedandtheirsupportshouldbehigh.Becausetheiractualsupportislow, andformaninterestinginfrequentpattern.Suchpatternsareknownasindirectassociationpatterns.

Ahigh-levelillustrationofindirectassociationisshowninFigure6.27 .Itemsaandbcorrespondto and ,whileY,whichisknownasthemediatorset,containsitemssuchas and .Aformaldefinitionofindirectassociationispresentednext.

Figure6.27.Anindirectassociationbetweenapairofitems.

Definition6.13(IndirectAssociation).

Apairofitemsa,bisindirectlyassociatedviaamediatorsetYifthefollowingconditionshold:

1. (Itempairsupportcondition).2. suchthat:

a. and (Mediatorsupportcondition).

b. ,where isanobjectivemeasureoftheassociationbetweenXandZ(Mediatordependencecondition).

NotethatthemediatorsupportanddependenceconditionsareusedtoensurethatitemsinYformacloseneighborhoodtobothaandb.Someofthedependencemeasuresthatcanbeusedincludeinterest,cosineorIS,Jaccard,andothermeasurespreviouslydescribedinSection5.7.1 onpage402.

Indirectassociationhasmanypotentialapplications.Inthemarketbasketdomain,aandbmayrefertocompetingitemssuchas and

.Intextmining,indirectassociationcanbeusedtoidentifysynonyms,antonyms,orwordsthatareusedindifferentcontexts.Forexample,givenacollectionofdocuments,theword maybeindirectlyassociatedwith viathemediator .Thispatternsuggeststhattheword canbeusedintwodifferentcontexts—dataminingversusgoldmining.

Indirectassociationscanbegeneratedinthefollowingway.First,thesetoffrequentitemsetsisgeneratedusingstandardalgorithmssuchasApriorior

s({a,b})<ts∃Y≠0

s({a}∪Y)≥tf s({b}∪Y)≥tf

d({a}Y)≥td,d({b}Y)≥td d(X,Z)

FP-growth.Eachpairoffrequentk-itemsetsarethenmergedtoobtainacandidateindirectassociation ,whereaandbareapairofitemsandYistheircommonmediator.Forexample,if and arefrequent3-itemsets,thenthecandidateindirectassociation isobtainedbymergingthepairoffrequentitemsets.Oncethecandidateshavebeengenerated,itisnecessarytoverifythattheysatisfytheitempairsupportandmediatordependenceconditionsprovidedinDefinition6.13 .However,themediatorsupportconditiondoesnothavetobeverifiedbecausethecandidateindirectassociationisobtainedbymergingapairoffrequentitemsets.AsummaryofthealgorithmisshowninAlgorithm6.4 .

Algorithm6.4Algorithmforminingindirect

associations.1. Generate ,thesetoffrequentitemsets.2. for to do3.4. foreachcandidate do5. if then6.7. endif8. endfor9. endfor10. Result= .

(a,b,Y){p,q,r} {p,q,s}

(r,s,{p,q})

Fkk=2 kmax

Ck={(a,b,Y)|{a}∪Y∈Fk,{b}∪Y∈Fk,a≠b}(a,b,Y)∈Ck

s({a,b})<ts∧d({a},Y)≥td∧d({b},Y)≥tdIk=Ik∪{(a,b,Y)}

∪Ik

6.7BibliographicNotesTheproblemofminingassociationrulesfromcategoricalandcontinuousdatawasintroducedbySrikantandAgrawalin[495].Theirstrategywastobinarizethecategoricalattributesandtoapplyequal-frequencydiscretizationtothecontinuousattributes.Apartialcompletenessmeasurewasalsoproposedtodeterminetheamountofinformationlossasaresultofdiscretization.Thismeasurewasthenusedtodeterminethenumberofdiscreteintervalsneededtoensurethattheamountofinformationlosscanbekeptatacertaindesiredlevel.Followingthiswork,numerousotherformulationshavebeenproposedforminingquantitativeassociationrules.Insteadofdiscretizingthequantitativeattributes,astatistical-basedapproachwasdevelopedbyAumannandLindell[465],wheresummarystatisticssuchasmeanandstandarddeviationarecomputedforthequantitativeattributesoftherules.ThisformulationwaslaterextendedbyotherauthorsincludingWebb[501]andZhangetal.[506].Themin-ApriorialgorithmwasdevelopedbyHanetal.[474]forfindingassociationrulesincontinuousdatawithoutdiscretization.Followingthemin-Apriori,arangeoftechniquesforcapturingdifferenttypesofassociationsamongcontinuousattributeshavebeenexplored.Forexample,theRAngesupportPatterns(RAP)developedbyPandeyetal.[487]findsgroupsofattributesthatshowcoherentvaluesacrossmultiplerowsofthedatamatrix.TheRAPframeworkwasextendedbytodealwithnoisydatabyGuptaetal.[473].Sincetherulescanbedesignedtosatisfymultipleobjectives,evolutionaryalgorithmsforminingquantitativeassociationrules[484,485]havealsobeendeveloped.OthertechniquesincludethoseproposedbyFukudaetal.[471],Lentetal.[480],Wangetal.[500],Ruckertetal.[490]andMillerandYang[486].

ThemethoddescribedinSection6.3 forhandlingconcepthierarchyusingextendedtransactionswasdevelopedbySrikantandAgrawal[494].AnalternativealgorithmwasproposedbyHanandFu[475],wherefrequentitemsetsaregeneratedonelevelatatime.Morespecifically,theiralgorithminitiallygeneratesallthefrequent1-itemsetsatthetopleveloftheconcepthierarchy.Thesetoffrequent1-itemsetsisdenotedasL(1,1).Usingthefrequent1-itemsetsinL(1,1),thealgorithmproceedstogenerateallfrequent2-itemsetsatlevel1,L(1,2).Thisprocedureisrepeateduntilallthefrequentitemsetsinvolvingitemsfromthehighestlevelofthehierarchy,L(1,k)( ),areextracted.Thealgorithmthencontinuestoextractfrequentitemsetsatthenextlevelofthehierarchy,L(2,1),basedonthefrequentitemsetsinL(1,1).Theprocedureisrepeateduntilitterminatesatthelowestleveloftheconcepthierarchyrequestedbytheuser.

ThesequentialpatternformulationandalgorithmdescribedinSection6.4wasproposedbyAgrawalandSrikantin[463,496].Similarly,Mannilaetal.[483]introducedtheconceptoffrequentepisode,whichisusefulforminingsequentialpatternsfromalongstreamofevents.AnotherformulationofsequentialpatternminingbasedonregularexpressionswasproposedbyGarofalakisetal.in[472].Joshietal.haveattemptedtoreconcilethedifferencesbetweenvarioussequentialpatternformulations[477].TheresultwasauniversalformulationofsequentialpatternwiththedifferentcountingschemesdescribedinSection6.4.4 .AlternativealgorithmsforminingsequentialpatternswerealsoproposedbyPeietal.[489],Ayresetal.[466],Chengetal.[468],andSenoetal.[492].Areviewonsequentialpatternminingalgorithmscanbefoundin[482]and[493].Extensionsoftheformulationtomaximal[470,481]andclosed[499,504]sequentialpatternmininghavealsobeendevelopedinrecentyears.

ThefrequentsubgraphminingproblemwasinitiallyintroducedbyInokuchietal.in[476].Theyusedavertex-growingapproachforgeneratingfrequent

k>1

inducedsubgraphsfromagraphdataset.Theedge-growingstrategywasdevelopedbyKuramochiandKarypisin[478],wheretheyalsopresentedanApriori-likealgorithmcalledFSGthataddressesissuessuchasmultiplicityofcandidates,canonicallabeling,andvertexinvariantschemes.AnotherfrequentsubgraphminingalgorithmknownasgSpanwasdevelopedbyYanandHanin[503].TheauthorsproposedusingaminimumDFScodeforencodingthevarioussubgraphs.OthervariantsofthefrequentsubgraphminingproblemswereproposedbyZakiin[505],ParthasarathyandCoatneyin[488],andKuramochiandKarypisin[479].ArecentreviewongraphminingisgivenbyChengetal.in[469].

Theproblemofmininginfrequentpatternshasbeeninvestigatedbymanyauthors.Savasereetal.[491]examinedtheproblemofminingnegativeassociationrulesusingaconcepthierarchy.Tanetal.[497]proposedtheideaofminingindirectassociationsforsequentialandnon-sequentialdata.EfficientalgorithmsforminingnegativepatternshavealsobeenproposedbyBoulicautetal.[467],Tengetal.[498],Wuetal.[502],andAntonieandZaïane[464].

Bibliography[463]R.AgrawalandR.Srikant.MiningSequentialPatterns.InProc.ofIntl.

Conf.onDataEngineering,pages3–14,Taipei,Taiwan,1995.

[464]M.-L.AntonieandO.R.Za¨ıane.MiningPositiveandNegativeAssociationRules:AnApproachforConfinedRules.InProc.ofthe8thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages27–38,Pisa,Italy,September2004.

[465]Y.AumannandY.Lindell.AStatisticalTheoryforQuantitativeAssociationRules.InKDD99,pages261–270,SanDiego,CA,August1999.

[466]J.Ayres,J.Flannick,J.Gehrke,andT.Yiu.SequentialPatternminingusingabitmaprepresentation.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages429–435,Edmonton,Canada,July2002.

[467]J.-F.Boulicaut,A.Bykowski,andB.Jeudy.TowardstheTractableDiscoveryofAssociationRuleswithNegations.InProc.ofthe4thIntl.ConfonFlexibleQueryAnsweringSystemsFQAS’00,pages425–434,Warsaw,Poland,October2000.

[468]H.Cheng,X.Yan,andJ.Han.IncSpan:incrementalminingofsequentialpatternsinlargedatabase.InProc.ofthe10thIntl.Conf.on

KnowledgeDiscoveryandDataMining,pages527–532,Seattle,WA,August2004.

[469]H.Cheng,X.Yan,andJ.Han.MiningGraphPatterns.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages307–338.Springer,2014.

[470]P.Fournier-Viger,C.-W.Wu,A.Gomariz,andV.S.Tseng.VMSP:Efficientverticalminingofmaximalsequentialpatterns.InProceedingsoftheCanadianConferenceonArtificialIntelligence,pages83–94,2014.

[471]T.Fukuda,Y.Morimoto,S.Morishita,andT.Tokuyama.MiningOptimizedAssociationRulesforNumericAttributes.InProc.ofthe15thSymp.onPrinciplesofDatabaseSystems,pages182–191,Montreal,Canada,June1996.

[472]M.N.Garofalakis,R.Rastogi,andK.Shim.SPIRIT:SequentialPatternMiningwithRegularExpressionConstraints.InProc.ofthe25thVLDBConf.,pages223–234,Edinburgh,Scotland,1999.

[473]R.Gupta,N.Rao,andV.Kumar.Discoveryoferror-tolerantbiclustersfromnoisygeneexpressiondata.BMCbioinformatics,12(12):1,2011.

[474]E.-H.Han,G.Karypis,andV.Kumar.Min-Apriori:AnAlgorithmforFindingAssociationRulesinDatawithContinuousAttributes.http://www.cs.umn.edu/˜han,1997.

[475]J.HanandY.Fu.MiningMultiple-LevelAssociationRulesinLargeDatabases.IEEETrans.onKnowledgeandDataEngineering,11(5):798–804,1999.

[476]A.Inokuchi,T.Washio,andH.Motoda.AnApriori-basedAlgorithmforMiningFrequentSubstructuresfromGraphData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages13–23,Lyon,France,2000.

[477]M.V.Joshi,G.Karypis,andV.Kumar.AUniversalFormulationofSequentialPatterns.InProc.oftheKDD’2001workshoponTemporalDataMining,SanFrancisco,CA,August2001.

[478]M.KuramochiandG.Karypis.FrequentSubgraphDiscovery.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages313–320,SanJose,CA,November2001.

[479]M.KuramochiandG.Karypis.DiscoveringFrequentGeometricSubgraphs.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages258–265,MaebashiCity,Japan,December2002.

[480]B.Lent,A.Swami,andJ.Widom.ClusteringAssociationRules.InProc.ofthe13thIntl.Conf.onDataEngineering,pages220–231,Birmingham,U.K,April1997.

[481]C.LuoandS.M.Chung.Efficientminingofmaximalsequentialpatternsusingmultiplesamples.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages415–426,2005.

[482]N.R.MabroukehandC.Ezeife.Ataxonomyofsequentialpatternminingalgorithms.ACMComputingSurvey,43(1),2010.

[483]H.Mannila,H.Toivonen,andA.I.Verkamo.DiscoveryofFrequentEpisodesinEventSequences.DataMiningandKnowledgeDiscovery,1(3):259–289,November1997.

[484]D.Martin,A.Rosete,J.Alcalá-Fdez,andF.Herrera.Anewmultiobjectiveevolutionaryalgorithmforminingareducedsetofinterestingpositiveandnegativequantitativeassociationrules.IEEETransactionsonEvolutionaryComputation,18(1):54–69,2014.

[485]J.Mata,J.L.Alvarez,andJ.C.Riquelme.MiningNumericAssociationRuleswithGeneticAlgorithms.InProceedingsoftheInternationalConferenceonArtificialNeuralNetsandGeneticAlgorithms,pages264–267,Prague,CzechRepublic,2001.Springer.

[486]R.J.MillerandY.Yang.AssociationRulesoverIntervalData.InProc.of1997ACM-SIGMODIntl.Conf.onManagementofData,pages452–461,Tucson,AZ,May1997.

[487]G.Pandey,G.Atluri,M.Steinbach,C.L.Myers,andV.Kumar.Anassociationanalysisapproachtobiclustering.InProceedingsofthe15thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages677–686.ACM,2009.

[488]S.ParthasarathyandM.Coatney.EfficientDiscoveryofCommonSubstructuresinMacromolecules.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages362–369,MaebashiCity,Japan,December2002.

[489]J.Pei,J.Han,B.Mortazavi-Asl,Q.Chen,U.Dayal,andM.Hsu.PrefixSpan:MiningSequentialPatternsefficientlybyprefix-projectedpatterngrowth.InProcofthe17thIntl.Conf.onDataEngineering,Heidelberg,Germany,April2001.

[490]U.Ruckert,L.Richter,andS.Kramer.Quantitativeassociationrulesbasedonhalf-spaces:Anoptimizationapproach.InProceedingsoftheFourthIEEEInternationalConferenceonDataMining,pages507–510,2004.

[491]A.Savasere,E.Omiecinski,andS.Navathe.MiningforStrongNegativeAssociationsinaLargeDatabaseofCustomerTransactions.InProc.ofthe14thIntl.Conf.onDataEngineering,pages494–502,Orlando,Florida,February1998.

[492]M.SenoandG.Karypis.SLPMiner:AnAlgorithmforFindingFrequentSequentialPatternsUsingLength-DecreasingSupportConstraint.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages418–425,MaebashiCity,Japan,December2002.

[493]W.Shen,J.Wang,andJ.Han.SequentialPatternMining.InC.AggarwalandJ.Han,editors,FrequentPatternMining,pages261–282.Springer,2014.

[494]R.SrikantandR.Agrawal.MiningGeneralizedAssociationRules.InProc.ofthe21stVLDBConf.,pages407–419,Zurich,Switzerland,1995.

[495]R.SrikantandR.Agrawal.MiningQuantitativeAssociationRulesinLargeRelationalTables.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages1–12,Montreal,Canada,1996.

[496]R.SrikantandR.Agrawal.MiningSequentialPatterns:GeneralizationsandPerformanceImprovements.InProc.ofthe5thIntlConf.onExtendingDatabaseTechnology(EDBT’96),pages18–32,Avignon,France,1996.

[497]P.N.Tan,V.Kumar,andJ.Srivastava.IndirectAssociation:MiningHigherOrderDependenciesinData.InProc.ofthe4thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages632–637,Lyon,France,2000.

[498]W.G.Teng,M.J.Hsieh,andM.-S.Chen.OntheMiningofSubstitutionRulesforStatisticallyDependentItems.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages442–449,MaebashiCity,Japan,December2002.

[499]P.Tzvetkov,X.Yan,andJ.Han.TSP:Miningtop-kclosedsequentialpatterns.KnowledgeandInformationSystems,7(4):438–457,2005.

[500]K.Wang,S.H.Tay,andB.Liu.Interestingness-BasedIntervalMergerforNumericAssociationRules.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages121–128,NewYork,NY,August1998.

[501]G.I.Webb.Discoveringassociationswithnumericvariables.InProc.ofthe7thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages383–388,SanFrancisco,CA,August2001.

[502]X.Wu,C.Zhang,andS.Zhang.MiningBothPositiveandNegativeAssociationRules.ACMTrans.onInformationSystems,22(3):381–405,2004.

[503]X.YanandJ.Han.gSpan:Graph-basedSubstructurePatternMining.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages721–724,MaebashiCity,Japan,December2002.

[504]X.Yan,J.Han,andR.Afshar.CloSpan:Mining:Closedsequentialpatternsinlargedatasets.InProceedingsoftheSIAMInternationalConferenceonDataMining,pages166–177,2003.

[505]M.J.Zaki.Efficientlyminingfrequenttreesinaforest.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages71–80,Edmonton,Canada,July2002.

[506]H.Zhang,B.Padmanabhan,andA.Tuzhilin.OntheDiscoveryofSignificantStatisticalQuantitativeRules.InProc.ofthe10thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages374–383,Seattle,WA,August2004.

6.8Exercises1.ConsiderthetrafficaccidentdatasetshowninTable6.10 .

Table6.10.Trafficaccidentdataset.

WeatherCondition Driver’sCondition TrafficViolation SeatBelt CrashSeverity

Good Alcohol-impaired Exceedspeedlimit No Major

Bad Sober None Yes Minor

Good Sober Disobeystopsign Yes Minor

Good Sober Exceedspeedlimit Yes Major

Bad Sober Disobeytrafficsignal No Major

Good Alcohol-impaired Disobeystopsign Yes Minor

Bad Alcohol-impaired None Yes Major

Good Sober Disobeytrafficsignal Yes Major

Good Alcohol-impaired None No Major

Bad Sober Disobeytrafficsignal No Major

Good Alcohol-impaired Exceedspeedlimit Yes Major

Bad Sober Disobeystopsign Yes Minor

a. Showabinarizedversionofthedataset.

b. Whatisthemaximumwidthofeachtransactioninthebinarizeddata?

c. Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?

d. Createadatasetthatcontainsonlythefollowingasymmetricbinaryattributes:(

,onlyNonehasavalueof0.Therestoftheattributevaluesareassignedto1.Assumingthatthesupportthresholdis30%,howmanycandidateandfrequentitemsetswillbegenerated?

e. Comparethenumberofcandidateandfrequentitemsetsgeneratedinparts(c)and(d).

2.

a. ConsiderthedatasetshowninTable6.11 .Supposeweapplythefollowingdiscretizationstrategiestothecontinuousattributesofthedataset.

Table6.11.DatasetforExercise2 .

TID Temperature Pressure Alarm1 Alarm2 Alarm3

1 95 1105 0 0 1

2 85 1040 1 1 0

3 103 1090 1 1 1

4 97 1084 1 0 0

5 80 1038 0 1 1

6 100 1080 1 1 0

7 83 1025 1 0 1

8 86 1030 1 0 0

9 101 1100 1 1 1

D1: Partitiontherangeofeachcontinuousattributeinto3equal-sizedbins.

D2: Partitiontherangeofeachcontinuousattributeinto3bins;whereeachbincontainsanequalnumberoftransactions

Foreachstrategy,answerthefollowingquestions:

b. Thecontinuousattributecanalsobediscretizedusingaclusteringapproach.

Constructabinarizedversionofthedataset.

Deriveallthefrequentitemsetshavingsupport .

i.

ii. ≥30%

PlotagraphoftemperatureversuspressureforthedatapointsshowninTable6.11 .

Howmanynaturalclustersdoyouobservefromthegraph?Assignalabel( etc.)toeachclusterinthegraph.

Whattypeofclusteringalgorithmdoyouthinkcanbeusedtoidentifytheclusters?Stateyourreasonsclearly.

ReplacethetemperatureandpressureattributesinTable6.11withasymmetricbinaryattributes etc.Constructatransactionmatrixusingthenewattributes(alongwithattributesAlarm1,Alarm2,andAlarm3).

Deriveallthefrequentitemsetshavingsupport fromthebinarizeddata.

i.

ii.C1,C2,

iii.

iv.C1,C2,

v. ≥30%

3.ConsiderthedatasetshowninTable6.12 .Thefirstattributeiscontinuous,whiletheremainingtwoattributesareasymmetricbinary.Aruleisconsideredtobestrongifitssupportexceeds15%anditsconfidenceexceeds60%.ThedatagiveninTable6.12 supportsthefollowingtwostrongrules:

Table6.12.DatasetforExercise3 .

A B C

1 1 1

2 1 1

3 1 0

4 1 0

5 1 1

6 0 1

7 0 0

8 1 1

9 0 0

10 0 0

11 0 0

12 0 1

{(1≤A≤2),B=1}→{C=1}

(ii) {(5≤A≤8),B=1}→{C=1}

a. Computethesupportandconfidenceforbothrules.

b. TofindtherulesusingthetraditionalApriorialgorithm,weneedtodiscretizethecontinuousattributeA.Supposeweapplytheequalwidthbinningapproachtodiscretizethedata,with .Foreachbin-width,statewhethertheabovetworulesarediscoveredbytheApriorialgorithm.(NotethattherulesmaynotbeinthesameexactformasbeforebecauseitmaycontainwiderornarrowerintervalsforA.)Foreachrulethatcorrespondstooneoftheabovetworules,computeitssupportandconfidence.

c. Commentontheeffectivenessofusingtheequalwidthapproachforclassifyingtheabovedataset.Isthereabin-widththatallowsyoutofindbothrulessatisfactorily?Ifnot,whatalternativeapproachcanyoutaketoensurethatyouwillfindbothrules?

4.ConsiderthedatasetshowninTable6.13 .

Table6.13.DatasetforExercise4 .

Age(A) NumberofHoursOnlineperWeek(B)

0–5 5–10 10–20 20–30 30–40

10–15 2 3 5 3 2

15–25 2 5 10 10 3

25–35 10 15 5 3 2

35–50 4 6 5 3 2

a. Foreachcombinationofrulesgivenbelow,specifytherulethathasthehighestconfidence.

bin-width=2,3,4

b. SupposeweareinterestedinfindingtheaveragenumberofhoursspentonlineperweekbyInternetusersbetweentheageof15and35.Writethecorrespondingstatistics-basedassociationruletocharacterizethesegmentofusers.Tocomputetheaveragenumberofhoursspentonline,approximateeachintervalbyitsmidpointvalue(e.g.,use torepresenttheinterval ).

c. Testwhetherthequantitativeassociationrulegiveninpart(b)isstatisticallysignificantbycomparingitsmeanagainsttheaveragenumberofhoursspentonlinebyotheruserswhodonotbelongtotheagegroup.

5.Forthedatasetwiththeattributesgivenbelow,describehowyouwouldconvertitintoabinarytransactiondatasetappropriateforassociationanalysis.Specifically,indicateforeachattributeintheoriginaldataset

a. howmanybinaryattributesitwouldcorrespondtointhetransactiondataset,

b. howthevaluesoftheoriginalattributewouldbemappedtovaluesofthebinaryattributes,and

c. ifthereisanyhierarchicalstructureinthedatavaluesofanattributethatcouldbeusefulforgroupingthedataintofewerbinaryattributes.

Thefollowingisalistofattributesforthedatasetalongwiththeirpossiblevalues.Assumethatallattributesarecollectedonaper-studentbasis:

and

and

and

i. 15<A<25→10<B<20,10<A<25→10<B<20, 15<A<35→10<B<20.

ii. 15<A<25→10<B<20,15<A<25→5<B<20, 15<A<25→5<B<30

iii. 15<A<25→10<B<20 10<A<35→5<B<30

B=7.55<B<10

Year:Freshman,Sophomore,Junior,Senior,Graduate:Masters,Graduate:PhD,Professional

Zipcode:zipcodeforthehomeaddressofaU.S.student,zipcodeforthelocaladdressofanon-U.S.student

College:Agriculture,Architecture,ContinuingEducation,Education,LiberalArts,Engineering,NaturalSciences,Business,Law,Medical,Dentistry,Pharmacy,Nursing,VeterinaryMedicine

OnCampus:1ifthestudentlivesoncampus,0otherwise

Eachofthefollowingisaseparateattributethathasavalueof1ifthepersonspeaksthelanguageandavalueof0,otherwise.

–Arabic

–Bengali

–ChineseMandarin

–English

–Portuguese

–Russian

–Spanish

6.ConsiderthedatasetshowninTable6.14 .Supposeweareinterestedinextractingthefollowingassociationrule:

Table6.14.DatasetforExercise6 .

Age PlayPiano EnjoyClassicalMusic

{α1≤Age≤α2,Play Piano=Yes}→{EnjoyClassicalMusic=Yes }

9 Yes Yes

11 Yes Yes

14 Yes No

17 Yes No

19 Yes Yes

21 No No

25 No No

29 Yes Yes

33 No No

39 No Yes

41 No No

47 No Yes

Tohandlethecontinuousattribute,weapplytheequal-frequencyapproachwith3,4,and6intervals.Categoricalattributesarehandledbyintroducingasmanynewasymmetricbinaryattributesasthenumberofcategoricalvalues.Assumethatthesupportthresholdis10%andtheconfidencethresholdis70%.

a. SupposewediscretizetheAgeattributeinto3equal-frequencyintervals.Findapairofvaluesfor and thatsatisfytheminimumsupportandminimumconfidencerequirements.

b. Repeatpart(a)bydiscretizingtheAgeattributeinto4equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtained

α1 α2

inpart(a).

c. Repeatpart(a)bydiscretizingtheAgeattributeinto6equal-frequencyintervals.Comparetheextractedrulesagainsttheonesyouhadobtainedinpart(a).

d. Fromtheresultsinpart(a),(b),and(c),discusshowthechoiceofdiscretizationintervalswillaffecttherulesextractedbyassociationruleminingalgorithms.

7.ConsiderthetransactionsshowninTable6.15 ,withanitemtaxonomygiveninFigure6.25 .

Table6.15.Exampleofmarketbaskettransactions.

TransactionID ItemsBought

1 Chips,Cookies,RegularSoda,Ham

2 Chips,Ham,BonelessChicken,DietSoda

3 Ham,Bacon,WholeChicken,RegularSoda

4 Chips,Ham,BonelessChicken,DietSoda

5 Chips,Bacon,BonelessChicken

6 Chips,Ham,Bacon,WholeChicken,RegularSoda

7 Chips,Cookies,BonelessChicken,DietSoda

a. Whatarethemainchallengesofminingassociationruleswithitemtaxonomy?

b. Considertheapproachwhereeachtransactiontisreplacedbyanextendedtransaction thatcontainsalltheitemsintaswellastheirt′

respectiveancestors.Forexample,thetransactiont=willbereplacedbyt= .Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .

c. Consideranalternativeapproachwherethefrequentitemsetsaregeneratedonelevelatatime.Initially,allthefrequentitemsetsinvolvingitemsatthehighestlevelofthehierarchyaregenerated.Next,weusethefrequentitemsetsdiscoveredatthehigherlevelofthehierarchytogeneratecandidateitemsetsinvolvingitemsatthelowerlevelsofthehierarchy.Forexample,wegeneratethecandidateitemset

onlyif isfrequent.Usethisapproachtoderiveallfrequentitemsets(uptosize4)withsupport .

d. Comparethefrequentitemsetsfoundinparts(b)and(c).Commentontheefficiencyandcompletenessofthealgorithms.

8.Thefollowingquestionsexaminehowthesupportandconfidenceofanassociationrulemayvaryinthepresenceofaconcepthierarchy.

a. Consideranitemxinagivenconcepthierarchy.Letdenotethekchildrenofxintheconcepthierarchy.Showthat

,where isthesupportofanitem.Underwhatconditionswilltheinequalitybecomeanequality?

b. Letpandqdenoteapairofitems,while and aretheircorrespondingparentsintheconcepthierarchy.If >minsup,whichofthefollowingitemsetsareguaranteedtobefrequent?(i)

,(ii) ,and(iii) .

c. Considertheassociationrule .Supposetheconfidenceoftheruleexceedsminconf.Whichofthefollowingrulesareguaranteedtohaveconfidencehigherthanminconf?(i) ,(ii) ,and(iii)

.

≥70%

≥70%

x¯1,x¯2,…,x¯k

s(x)≤∑i=1ks(x¯i) s(⋅)

p^ q^s({p,q})

s({p^,q}) s({p,q^}) s({p^,q^})

{p}→{q}

{p}→{q^} {p^}→{q} {p^}→{q^}

9.

a. Listallthe4-subsequencescontainedinthefollowingdatasequence:

assumingnotimingconstraints.

b. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)assumingthatnotimingconstraintsareimposed.

c. Listallthe4-subsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).

d. Listallthe3-elementsubsequencescontainedinthedatasequenceforpart(a)(assumingthetimingconstraintsareflexible).

10.Findallthefrequentsubsequenceswithsupport giventhesequencedatabaseshowninTable6.16 .Assumethattherearenotimingconstraintsimposedonthesequences.

Table6.16.Exampleofeventsequencesgeneratedbyvarioussensors.

Sensor Timestamp Events

S1 1 A,B

2 C

3 D,E

4 C

S2 1 A,B

2 C,D

{1,3}  {2}  {2,3}  {4},

≥50%

3 E

S3 1 B

2 A

3 B

4 D,E

S4 1 C

2 D,E

3 C

4 E

S5 1 B

2 A

3 B,C

4 A,D

11.

a. Foreachofthesequences givenbelow,determinewhethertheyaresubsequencesofthesequence

subjectedtothefollowingtimingconstraints:

mingap=0 (intervalbetweenlasteventin andfirsteventin is )

w=⟨e1e2…ei…ei+1…elast⟩

⟨{1,2,3}{2,4}{2,4,5}{3,5}{6}⟩

ei ei+1 ⟩0

maxgap=3 (intervalbetweenfirsteventin andlasteventin is )

maxspan=5 (intervalbetweenfirsteventin andlasteventin is )

ws=1 (timebetweenfirstandlasteventsin is )

b. Determinewhethereachofthesubsequenceswgiveninthepreviousquestionarecontiguoussubsequencesofthefollowingsequencess.

12.Foreachofthesequence below,determinewhethertheyaresubsequencesofthefollowingdatasequence:

subjectedtothefollowingtimingconstraints:

mingap=0 (intervalbetweenlasteventin andfirsteventin is )

maxgap=2 (intervalbetweenfirsteventin andlasteventin is )

ei ei+1 ≤3

ei elast ≤5

ei ≤1

w=⟨{1}{2}{3}⟩

w=⟨{1,2,3,4}{5,6}⟩

w=⟨{2,4}{2,4}{6}⟩

w=⟨{1}{2,4}{6}⟩

w=⟨{1,2}{3,4}{5,6}⟩

s=⟨{1,2,3,4,5,6}{1,2,3,4,5,6}{1,2,3,4,5,6}⟩

s=⟨{1,2,3,4}{1,2,3,4,5,6}{3,4,5,6}⟩

s=⟨{1,2}{1,2,3,4}{3,4,5,6}{5,6}⟩

s=⟨{1,2,3}{2,3,4,5}{4,5,6}⟩

w=⟨e1,…,elast⟩

⟨{A,B}{C,D}{A,B}{C,D}{A,B}{C,D}⟩

ei ei+1 >0

ei ei+1 ≤2

maxspan=6 (intervalbetweenfirsteventin andlasteventin is )

ws=1 (timebetweenfirstandlasteventsin is )

a.

b.

c.

d.

e.

13.Considerthefollowingfrequent3-sequences:

and .

a. Listallthecandidate4-sequencesproducedbythecandidategenerationstepoftheGSPalgorithm.

b. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingnotimingconstraints).

c. Listallthecandidate4-sequencesprunedduringthecandidatepruningstepoftheGSPalgorithm(assumingmaxgap=1).

14.ConsiderthedatasequenceshowninTable6.17 foragivenobject.Countthenumberofoccurrencesforthesequence accordingtothefollowingcountingmethods:

Table6.17.ExampleofeventsequencedataforExercise14 .

Timestamp Events

1 p,q

ei elast ≤6

ei ≤1

w=⟨{A}{B}{C}{D}⟩

w=⟨{A}{B,C,D}{A}⟩

w=⟨{A}{A,B,C,D}{A}⟩

w=⟨{B,C}{A,D}{B,C}⟩

w=⟨{A,B,C,D}{A,B,C,D}⟩

⟨{1,2,3}⟩,⟨{1,2}{3}⟩,⟨{1}{2,3}⟩,⟨{1,2}{4}⟩,⟨{1,3}{4}⟩,⟨{1,2,4}⟩,⟨{2,3}{3}⟩,⟨{2,3}{4}⟩,⟨{2}{3}{3}⟩, ⟨{2}{3}{4}⟩

⟨{p}{q}{r}⟩

2 r

3 s

4 p,q

5 r,s

6 p

7 q,r

8 q,s

9 p

10 q,r,s

a. COBJ(oneoccurrenceperobject).

b. CWIN(oneoccurrenceperslidingwindow).

c. CMINWIN(numberofminimalwindowsofoccurrence).

d. CDIST_O(distinctoccurrenceswithpossibilityofevent-timestampoverlap).

e. CDIST(distinctoccurrenceswithnoeventtimestampoverlapallowed).

15.Describethetypesofmodificationsnecessarytoadaptthefrequentsubgraphminingalgorithmtohandle:

a. Directedgraphs

b. Unlabeledgraphs

c. Acyclicgraphs

d. Disconnectedgraphs

Foreachtypeofgraphgivenabove,describewhichstepofthealgorithmwillbeaffected(candidategeneration,candidatepruning,andsupportcounting),andanyfurtheroptimizationthatcanhelpimprovetheefficiencyofthealgorithm.

16.DrawallcandidatesubgraphsobtainedfromjoiningthepairofgraphsshowninFigure6.28 .

Figure6.28.GraphsforExercise16 .

17.DrawallthecandidatesubgraphsobtainedbyjoiningthepairofgraphsshowninFigure6.29 .

Figure6.29.GraphsforExercise17 .

18.ShowthatthecandidategenerationprocedureintroducedinSection6.5.3 forfrequentsubgraphminingiscomplete,i.e.,nofrequentk-subgraphcanbemissedfrombeinggeneratedifeverypairoffrequent -subgraphsisconsideredformerging.

19.

a. Ifsupportisdefinedintermsofinducedsubgraphrelationship,showthattheconfidenceoftherule canbegreaterthan1if and areallowedtohaveoverlappingvertexsets.

b. Whatisthetimecomplexityneededtodeterminethecanonicallabelofagraphthatcontains vertices?

c. Thecoreofasubgraphcanhavemultipleautomorphisms.Thiswillincreasethenumberofcandidatesubgraphsobtainedaftermergingtwo

(k−1)

g1→g2 g1 g2

|V|

frequentsubgraphsthatsharethesamecore.Determinethemaximumnumberofcandidatesubgraphsobtainedduetoautomorphismofacoreofsizek.

d. Twofrequentsubgraphsofsizekmaysharemultiplecores.Determinethemaximumnumberofcoresthatcanbesharedbythetwofrequentsubgraphs.

20.(a)Considerthetwographsshownbelow.

21.Theoriginalassociationruleminingframeworkconsidersonlypresenceofitemstogetherinthesametransaction.Therearesituationsinwhichitemsetsthatareinfrequentmayalsobeinformative.Forinstance,theitemsetTV,DVD,¬VCRsuggeststhatmanycustomerswhobuyTVsandDVDsdonotbuyVCRs.

Drawallthedistinctcoresobtainedwhenmergingthetwosubgraphs.

Howmanycandidatesaregeneratedusingthefollowingcore?

Inthisproblem,youareaskedtoextendtheassociationruleframeworktonegativeitemsets(i.e.,itemsetsthatcontainbothpresenceandabsenceofitems).Wewillusethenegationsymbol(¬)torefertoabsenceofitems.

a. AnaïvewayforderivingnegativeitemsetsistoextendeachtransactiontoincludeabsenceofitemsasshowninTable6.18 .

Table6.18.Exampleofnumericdataset.

TID TV ¬TV DVD ¬DVD VCR ¬VCR …

1 1 0 0 1 0 1 …

2 1 0 0 1 0 1 …

b. ConsiderthedatabaseshowninTable6.15 .Whatarethesupportandconfidencevaluesforthefollowingnegativeassociationrulesinvolvingregularanddietsoda?

Supposethetransactiondatabasecontains1000distinctitems.Whatisthetotalnumberofpositiveitemsetsthatcanbegeneratedfromtheseitems?(Note:Apositiveitemsetdoesnotcontainanynegateditems).

Whatisthemaximumnumberoffrequentitemsetsthatcanbegeneratedfromthesetransactions?(Assumethatafrequentitemsetmaycontainpositive,negative,orbothtypesofitems)

Explainwhysuchana¨ıvemethodofextendingeachtransactionwithnegativeitemsisnotpracticalforderivingnegativeitemsets.

i.

ii.

iii.

¬Regular→Diet.

Regular→¬Diet.

¬Diet→Regular.

i.

ii.

iii.

22.Supposewewouldliketoextractpositiveandnegativeitemsetsfromadatasetthatcontainsditems.

a. Consideranapproachwhereweintroduceanewvariabletorepresenteachnegativeitem.Withthisapproach,thenumberofitemsgrowsfromdto2d.Whatisthetotalsizeoftheitemsetlattice,assumingthatanitemsetmaycontainbothpositiveandnegativeitemsofthesamevariable?

b. Assumethatanitemsetmustcontainpositiveornegativeitemsofdifferentvariables.Forexample,theitemset isinvalidbecauseitcontainsbothpositiveandnegativeitemsforvariablea.Whatisthetotalsizeoftheitemsetlattice?

23.Foreachtypeofpatterndefinedbelow,determinewhetherthesupportmeasureismonotone,anti-monotone,ornon-monotone(i.e.,neithermonotonenoranti-monotone)withrespecttoincreasingitemsetsize.

a. Itemsetsthatcontainbothpositiveandnegativeitemssuchas.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?

b. Booleanlogicalpatternssuchas ,whichmaycontainbothdisjunctionsandconjunctionsofitems.Isthesupportmeasuremonotone,anti-monotone,ornon-monotonewhenappliedtosuchpatterns?

24.ManyassociationanalysisalgorithmsrelyonanApriori-likeapproachforfindingfrequentpatterns.Theoverallstructureofthealgorithmisgivenbelow.

SupposeweareinterestedinfindingBooleanlogicalrulessuchas

Diet→¬Regular.iv.

{a,a¯,b,c¯}

{a,b,c¯,d¯}

{(a∨b∨c),d,e}

whichmaycontainbothdisjunctionsandconjunctionsofitems.Thecorrespondingitemsetcanbewrittenas .

a. DoestheAprioriprinciplestillholdforsuchitemsets?

b. Howshouldthecandidategenerationstepbemodifiedtofindsuchpatterns?

c. Howshouldthecandidatepruningstepbemodifiedtofindsuchpatterns?

d. Howshouldthesupportcountingstepbemodifiedtofindsuchpatterns?

Algorithm6.5Apriori-likealgorithm.1.

2. {Findfrequent1-patterns.}

3. repeat

4. .

5. .{CandidateGeneration}

6. .{CandidatePruning}

7. .{SupportCounting}

8. .{Extractfrequentpatterns}

9. until

10. Answer= .

{a∨b}→{c,d},

{(a∨b),c,d}

k=1

Fk={i|i∈I∧σ{i}N≥minsup}

k=k+1

Ck=genCandidate(Fk−1)

Ck=pruneCandidate (Ck,Fk−1)

Ck=count (Ck,D)

Fk={c|c∈Ck∧σ(c)N≥minsup}Fk=∅

∪Fk

7ClusterAnalysis:BasicConceptsandAlgorithms

Clusteranalysisdividesdataintogroups(clusters)thataremeaningful,useful,orboth.Ifmeaningfulgroupsarethegoal,thentheclustersshouldcapturethenaturalstructureofthedata.Insomecases,however,clusteranalysisisusedfordatasummarizationinordertoreducethesizeofthedata.Whetherforunderstandingorutility,clusteranalysishaslongplayedanimportantroleinawidevarietyoffields:psychologyandothersocialsciences,biology,statistics,patternrecognition,informationretrieval,machinelearning,anddatamining.

Therehavebeenmanyapplicationsofclusteranalysistopracticalproblems.Weprovidesomespecificexamples,organizedbywhetherthepurposeoftheclusteringisunderstandingorutility.

ClusteringforUnderstandingClasses,orconceptuallymeaningfulgroupsofobjectsthatsharecommoncharacteristics,playanimportantroleinhowpeopleanalyzeanddescribetheworld.Indeed,humanbeingsareskilledat

dividingobjectsintogroups(clustering)andassigningparticularobjectstothesegroups(classification).Forexample,evenrelativelyyoungchildrencanquicklylabeltheobjectsinaphotograph.Inthecontextofunderstandingdata,clustersarepotentialclassesandclusteranalysisisthestudyoftechniquesforautomaticallyfindingclasses.Thefollowingaresomeexamples:

Biology.Biologistshavespentmanyyearscreatingataxonomy(hierarchicalclassification)ofalllivingthings:kingdom,phylum,class,order,family,genus,andspecies.Thus,itisperhapsnotsurprisingthatmuchoftheearlyworkinclusteranalysissoughttocreateadisciplineofmathematicaltaxonomythatcouldautomaticallyfindsuchclassificationstructures.Morerecently,biologistshaveappliedclusteringtoanalyzethelargeamountsofgeneticinformationthatarenowavailable.Forexample,clusteringhasbeenusedtofindgroupsofgenesthathavesimilarfunctions.InformationRetrieval.TheWorldWideWebconsistsofbillionsofwebpages,andtheresultsofaquerytoasearchenginecanreturnthousandsofpages.Clusteringcanbeusedtogroupthesesearchresultsintoasmallnumberofclusters,eachofwhichcapturesaparticularaspectofthequery.Forinstance,aqueryof“movie”mightreturnwebpagesgroupedintocategoriessuchasreviews,trailers,stars,andtheaters.Eachcategory(cluster)canbebrokenintosubcategories(subclusters),producingahierarchicalstructurethatfurtherassistsauser’sexplorationofthequeryresults.Climate.UnderstandingtheEarth’sclimaterequiresfindingpatternsintheatmosphereandocean.Tothatend,clusteranalysishasbeenappliedtofindpatternsinatmosphericpressureandoceantemperaturethathaveasignificantimpactonclimate.PsychologyandMedicine.Anillnessorconditionfrequentlyhasanumberofvariations,andclusteranalysiscanbeusedtoidentifythesedifferentsubcategories.Forexample,clusteringhasbeenusedtoidentify

differenttypesofdepression.Clusteranalysiscanalsobeusedtodetectpatternsinthespatialortemporaldistributionofadisease.Business.Businessescollectlargeamountsofinformationaboutcurrentandpotentialcustomers.Clusteringcanbeusedtosegmentcustomersintoasmallnumberofgroupsforadditionalanalysisandmarketingactivities.

ClusteringforUtilityClusteranalysisprovidesanabstractionfromindividualdataobjectstotheclustersinwhichthosedataobjectsreside.Additionally,someclusteringtechniquescharacterizeeachclusterintermsofaclusterprototype;i.e.,adataobjectthatisrepresentativeoftheobjectsinthecluster.Theseclusterprototypescanbeusedasthebasisforanumberofadditionaldataanalysisordataprocessingtechniques.Therefore,inthecontextofutility,clusteranalysisisthestudyoftechniquesforfindingthemostrepresentativeclusterprototypes.

Summarization.Manydataanalysistechniques,suchasregressionorprincipalcomponentanalysis,haveatimeorspacecomplexityof orhigher(wheremisthenumberofobjects),andthus,arenotpracticalforlargedatasets.However,insteadofapplyingthealgorithmtotheentiredataset,itcanbeappliedtoareduceddatasetconsistingonlyofclusterprototypes.Dependingonthetypeofanalysis,thenumberofprototypes,andtheaccuracywithwhichtheprototypesrepresentthedata,theresultscanbecomparabletothosethatwouldhavebeenobtainedifallthedatacouldhavebeenused.Compression.Clusterprototypescanalsobeusedfordatacompression.Inparticular,atableiscreatedthatconsistsoftheprototypesforeachcluster;i.e.,eachprototypeisassignedanintegervaluethatisitsposition(index)inthetable.Eachobjectisrepresentedbytheindexoftheprototypeassociatedwithitscluster.Thistypeofcompressionisknownasvectorquantizationandisoftenappliedtoimage,sound,andvideodata,

O(m2)

where(1)manyofthedataobjectsarehighlysimilartooneanother,(2)somelossofinformationisacceptable,and(3)asubstantialreductioninthedatasizeisdesired.EfficientlyFindingNearestNeighbors.Findingnearestneighborscanrequirecomputingthepairwisedistancebetweenallpoints.Oftenclustersandtheirclusterprototypescanbefoundmuchmoreefficiently.Ifobjectsarerelativelyclosetotheprototypeoftheircluster,thenwecanusetheprototypestoreducethenumberofdistancecomputationsthatarenecessarytofindthenearestneighborsofanobject.Intuitively,iftwoclusterprototypesarefarapart,thentheobjectsinthecorrespondingclusterscannotbenearestneighborsofeachother.Consequently,tofindanobject’snearestneighbors,itisnecessarytocomputeonlythedistancetoobjectsinnearbyclusters,wherethenearnessoftwoclustersismeasuredbythedistancebetweentheirprototypes.ThisideaismademorepreciseinExercise25 ofChapter2 ,whichisonpage111.

Thischapterprovidesanintroductiontoclusteranalysis.Webeginwithahigh-leveloverviewofclustering,includingadiscussionofthevariousapproachestodividingobjectsintosetsofclustersandthedifferenttypesofclusters.Wethendescribethreespecificclusteringtechniquesthatrepresentbroadcategoriesofalgorithmsandillustrateavarietyofconcepts:K-means,agglomerativehierarchicalclustering,andDBSCAN.Thefinalsectionofthischapterisdevotedtoclustervalidity—methodsforevaluatingthegoodnessoftheclustersproducedbyaclusteringalgorithm.MoreadvancedclusteringconceptsandalgorithmswillbediscussedinChapter8 .Wheneverpossible,wediscussthestrengthsandweaknessesofdifferentschemes.Inaddition,theBibliographicNotesprovidereferencestorelevantbooksandpapersthatexploreclusteranalysisingreaterdepth.

7.1OverviewBeforediscussingspecificclusteringtechniques,weprovidesomenecessarybackground.First,wefurtherdefineclusteranalysis,illustratingwhyitisdifficultandexplainingitsrelationshiptoothertechniquesthatgroupdata.Thenweexploretwoimportanttopics:(1)differentwaystogroupasetofobjectsintoasetofclusters,and(2)typesofclusters.

7.1.1WhatIsClusterAnalysis?

Clusteranalysisgroupsdataobjectsbasedoninformationfoundonlyinthedatathatdescribestheobjectsandtheirrelationships.Thegoalisthattheobjectswithinagroupbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.Thegreaterthesimilarity(orhomogeneity)withinagroupandthegreaterthedifferencebetweengroups,thebetterormoredistincttheclustering.

Inmanyapplications,thenotionofaclusterisnotwelldefined.Tobetterunderstandthedifficultyofdecidingwhatconstitutesacluster,considerFigure7.1 ,whichshows20pointsandthreedifferentwaysofdividingthemintoclusters.Theshapesofthemarkersindicateclustermembership.Figures7.1(b) and7.1(d) dividethedataintotwoandsixparts,respectively.However,theapparentdivisionofeachofthetwolargerclustersintothreesubclustersmaysimplybeanartifactofthehumanvisualsystem.Also,itmaynotbeunreasonabletosaythatthepointsformfourclusters,asshowninFigure7.1(c) .Thisfigureillustratesthatthedefinitionofacluster

isimpreciseandthatthebestdefinitiondependsonthenatureofdataandthedesiredresults.

Figure7.1.Threedifferentwaysofclusteringthesamesetofpoints.

Clusteranalysisisrelatedtoothertechniquesthatareusedtodividedataobjectsintogroups.Forinstance,clusteringcanberegardedasaformofclassificationinthatitcreatesalabelingofobjectswithclass(cluster)labels.However,itderivestheselabelsonlyfromthedata.Incontrast,classificationinthesenseofChapter3 issupervisedclassification;i.e.,new,unlabeledobjectsareassignedaclasslabelusingamodeldevelopedfromobjectswithknownclasslabels.Forthisreason,clusteranalysisissometimesreferredtoasunsupervisedclassification.Whenthetermclassificationisusedwithoutanyqualificationwithindatamining,ittypicallyreferstosupervisedclassification.

Also,whilethetermssegmentationandpartitioningaresometimesusedassynonymsforclustering,thesetermsarefrequentlyusedforapproachesoutsidethetraditionalboundsofclusteranalysis.Forexample,theterm

partitioningisoftenusedinconnectionwithtechniquesthatdividegraphsintosubgraphsandthatarenotstronglyconnectedtoclustering.Segmentationoftenreferstothedivisionofdataintogroupsusingsimpletechniques;e.g.,animagecanbesplitintosegmentsbasedonlyonpixelintensityandcolor,orpeoplecanbedividedintogroupsbasedontheirincome.Nonetheless,someworkingraphpartitioningandinimageandsegmentationisrelatedtoclusteranalysis.

7.1.2DifferentTypesofClusterings

Anentirecollectionofclustersiscommonlyreferredtoasaclustering,andinthissection,wedistinguishvarioustypesofclusterings:hierarchical(nested)versuspartitional(unnested),exclusiveversusoverlappingversusfuzzy,andcompleteversuspartial.

HierarchicalversusPartitional

Themostcommonlydiscusseddistinctionamongdifferenttypesofclusteringsiswhetherthesetofclustersisnestedorunnested,orinmoretraditionalterminology,hierarchicalorpartitional.Apartitionalclusteringissimplyadivisionofthesetofdataobjectsintonon-overlappingsubsets(clusters)suchthateachdataobjectisinexactlyonesubset.Takenindividually,eachcollectionofclustersinFigures7.1(b–d) isapartitionalclustering.

Ifwepermitclusterstohavesubclusters,thenweobtainahierarchicalclustering,whichisasetofnestedclustersthatareorganizedasatree.Eachnode(cluster)inthetree(exceptfortheleafnodes)istheunionofitschildren(subclusters),andtherootofthetreeistheclustercontainingalltheobjects.Often,butnotalways,theleavesofthetreearesingletonclustersof

individualdataobjects.Ifweallowclusterstobenested,thenoneinterpretationofFigure7.1(a) isthatithastwosubclusters(Figure7.1(b) ),eachofwhich,inturn,hasthreesubclusters(Figure7.1(d) ).TheclustersshowninFigures7.1(a–d) ,whentakeninthatorder,alsoformahierarchical(nested)clusteringwith,respectively,1,2,4,and6clustersoneachlevel.Finally,notethatahierarchicalclusteringcanbeviewedasasequenceofpartitionalclusteringsandapartitionalclusteringcanbeobtainedbytakinganymemberofthatsequence;i.e.,bycuttingthehierarchicaltreeataparticularlevel.

ExclusiveversusOverlappingversusFuzzy

TheclusteringsshowninFigure7.1 areallexclusive,astheyassigneachobjecttoasinglecluster.Therearemanysituationsinwhichapointcouldreasonablybeplacedinmorethanonecluster,andthesesituationsarebetteraddressedbynon-exclusiveclustering.Inthemostgeneralsense,anoverlappingornon-exclusiveclusteringisusedtoreflectthefactthatanobjectcansimultaneouslybelongtomorethanonegroup(class).Forinstance,apersonatauniversitycanbebothanenrolledstudentandanemployeeoftheuniversity.Anon-exclusiveclusteringisalsooftenusedwhen,forexample,anobjectis“between”twoormoreclustersandcouldreasonablybeassignedtoanyoftheseclusters.ImagineapointhalfwaybetweentwooftheclustersofFigure7.1 .Ratherthanmakeasomewhatarbitraryassignmentoftheobjecttoasinglecluster,itisplacedinallofthe“equallygood”clusters.

Inafuzzyclustering(Section8.2.1 ),everyobjectbelongstoeveryclusterwithamembershipweightthatisbetween0(absolutelydoesn’tbelong)and1(absolutelybelongs).Inotherwords,clustersaretreatedasfuzzysets.(Mathematically,afuzzysetisoneinwhichanobjectbelongstoeverysetwithaweightthatisbetween0and1.Infuzzyclustering,weoftenimposethe

additionalconstraintthatthesumoftheweightsforeachobjectmustequal1.)Similarly,probabilisticclusteringtechniques(Section8.2.2 )computetheprobabilitywithwhicheachpointbelongstoeachcluster,andtheseprobabilitiesmustalsosumto1.Becausethemembershipweightsorprobabilitiesforanyobjectsumto1,afuzzyorprobabilisticclusteringdoesnotaddresstruemulticlasssituations,suchasthecaseofastudentemployee,whereanobjectbelongstomultipleclasses.Instead,theseapproachesaremostappropriateforavoidingthearbitrarinessofassigninganobjecttoonlyoneclusterwhenitisclosetoseveral.Inpractice,afuzzyorprobabilisticclusteringisoftenconvertedtoanexclusiveclusteringbyassigningeachobjecttotheclusterinwhichitsmembershipweightorprobabilityishighest.

CompleteversusPartial

Acompleteclusteringassignseveryobjecttoacluster,whereasapartialclusteringdoesnot.Themotivationforapartialclusteringisthatsomeobjectsinadatasetmaynotbelongtowell-definedgroups.Manytimesobjectsinthedatasetrepresentnoise,outliers,or“uninterestingbackground.”Forexample,somenewspaperstoriesshareacommontheme,suchasglobalwarming,whileotherstoriesaremoregenericorone-of-a-kind.Thus,tofindtheimportanttopicsinlastmonth’sstories,weoftenwanttosearchonlyforclustersofdocumentsthataretightlyrelatedbyacommontheme.Inothercases,acompleteclusteringoftheobjectsisdesired.Forexample,anapplicationthatusesclusteringtoorganizedocumentsforbrowsingneedstoguaranteethatalldocumentscanbebrowsed.

a

7.1.3DifferentTypesofClusters

Clusteringaimstofindusefulgroupsofobjects(clusters),whereusefulnessisdefinedbythegoalsofthedataanalysis.Notsurprisingly,severaldifferentnotionsofaclusterproveusefulinpractice.Inordertovisuallyillustratethedifferencesamongthesetypesofclusters,weusetwo-dimensionalpoints,asshowninFigure7.2 ,asourdataobjects.Westress,however,thatthetypesofclustersdescribedhereareequallyvalidforotherkindsofdata.

Figure7.2.Differenttypesofclustersasillustratedbysetsoftwo-dimensionalpoints.

Well-Separated

Aclusterisasetofobjectsinwhicheachobjectiscloser(ormoresimilar)toeveryotherobjectintheclusterthantoanyobjectnotinthecluster.Sometimesathresholdisusedtospecifythatalltheobjectsinaclustermustbesufficientlyclose(orsimilar)tooneanother.Thisidealisticdefinitionofaclusterissatisfiedonlywhenthedatacontainsnaturalclustersthatarequitefarfromeachother.Figure7.2(a) givesanexampleofwell-separatedclustersthatconsistsoftwogroupsofpointsinatwo-dimensionalspace.Thedistancebetweenanytwopointsindifferentgroupsislargerthanthedistancebetweenanytwopointswithinagroup.Well-separatedclustersdonotneedtobeglobular,butcanhaveanyshape.

Prototype-Based

Aclusterisasetofobjectsinwhicheachobjectiscloser(moresimilar)totheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Fordatawithcontinuousattributes,theprototypeofaclusterisoftenacentroid,i.e.,theaverage(mean)ofallthepointsinthecluster.Whenacentroidisnotmeaningful,suchaswhenthedatahascategoricalattributes,theprototypeisoftenamedoid,i.e.,themostrepresentativepointofacluster.Formanytypesofdata,theprototypecanberegardedasthemostcentralpoint,andinsuchinstances,wecommonlyrefertoprototype-basedclustersascenter-basedclusters.Notsurprisingly,suchclusterstendtobeglobular.Figure7.2(b) showsanexampleofcenter-basedclusters.

Graph-Based

Ifthedataisrepresentedasagraph,wherethenodesareobjectsandthelinksrepresentconnectionsamongobjects(seeSection2.1.2 ),thenaclustercanbedefinedasaconnectedcomponent;i.e.,agroupofobjects

thatareconnectedtooneanother,butthathavenoconnectiontoobjectsoutsidethegroup.Animportantexampleofgraph-basedclustersisacontiguity-basedcluster,wheretwoobjectsareconnectedonlyiftheyarewithinaspecifieddistanceofeachother.Thisimpliesthateachobjectinacontiguity-basedclusterisclosertosomeotherobjectintheclusterthantoanypointinadifferentcluster.Figure7.2(c) showsanexampleofsuchclustersfortwo-dimensionalpoints.Thisdefinitionofaclusterisusefulwhenclustersareirregularorintertwined.However,thisapproachcanhavetroublewhennoiseispresentsince,asillustratedbythetwosphericalclustersofFigure7.2(c) ,asmallbridgeofpointscanmergetwodistinctclusters.

Othertypesofgraph-basedclustersarealsopossible.Onesuchapproach(Section7.3.2 )definesaclusterasaclique;i.e.,asetofnodesinagraphthatarecompletelyconnectedtoeachother.Specifically,ifweaddconnectionsbetweenobjectsintheorderoftheirdistancefromoneanother,aclusterisformedwhenasetofobjectsformsaclique.Likeprototype-basedclusters,suchclusterstendtobeglobular.

Density-Based

Aclusterisadenseregionofobjectsthatissurroundedbyaregionoflowdensity.Figure7.2(d) showssomedensity-basedclustersfordatacreatedbyaddingnoisetothedataofFigure7.2(c) .Thetwocircularclustersarenotmerged,asinFigure7.2(c) ,becausethebridgebetweenthemfadesintothenoise.Likewise,thecurvethatispresentinFigure7.2(c) alsofadesintothenoiseanddoesnotformaclusterinFigure7.2(d) .Adensity-baseddefinitionofaclusterisoftenemployedwhentheclustersareirregularorintertwined,andwhennoiseandoutliersarepresent.Bycontrast,acontiguity-baseddefinitionofaclusterwouldnotworkwellforthedataofFigure7.2(d) becausethenoisewouldtendtoformbridgesbetweenclusters.

Shared-Property(ConceptualClusters)

Moregenerally,wecandefineaclusterasasetofobjectsthatsharesomeproperty.Thisdefinitionencompassesallthepreviousdefinitionsofacluster;e.g.,objectsinacenter-basedclustersharethepropertythattheyareallclosesttothesamecentroidormedoid.However,theshared-propertyapproachalsoincludesnewtypesofclusters.ConsidertheclustersshowninFigure7.2(e) .Atriangulararea(cluster)isadjacenttoarectangularone,andtherearetwointertwinedcircles(clusters).Inbothcases,aclusteringalgorithmwouldneedaveryspecificconceptofaclustertosuccessfullydetecttheseclusters.Theprocessoffindingsuchclustersiscalledconceptualclustering.However,toosophisticatedanotionofaclusterwouldtakeusintotheareaofpatternrecognition,andthus,weonlyconsidersimplertypesofclustersinthisbook.

RoadMapInthischapter,weusethefollowingthreesimple,butimportanttechniquestointroducemanyoftheconceptsinvolvedinclusteranalysis.

K-means.Thisisaprototype-based,partitionalclusteringtechniquethatattemptstofindauser-specifiednumberofclusters(K),whicharerepresentedbytheircentroids.AgglomerativeHierarchicalClustering.Thisclusteringapproachreferstoacollectionofcloselyrelatedclusteringtechniquesthatproduceahierarchicalclusteringbystartingwitheachpointasasingletonclusterandthenrepeatedlymergingthetwoclosestclustersuntilasingle,all-encompassingclusterremains.Someofthesetechniqueshaveanaturalinterpretationintermsofgraph-basedclustering,whileothershaveaninterpretationintermsofaprototype-basedapproach.

DBSCAN.Thisisadensity-basedclusteringalgorithmthatproducesapartitionalclustering,inwhichthenumberofclustersisautomaticallydeterminedbythealgorithm.Pointsinlow-densityregionsareclassifiedasnoiseandomitted;thus,DBSCANdoesnotproduceacompleteclustering.

7.2K-meansPrototype-basedclusteringtechniquescreateaone-levelpartitioningofthedataobjects.Thereareanumberofsuchtechniques,buttwoofthemostprominentareK-meansandK-medoid.K-meansdefinesaprototypeintermsofacentroid,whichisusuallythemeanofagroupofpoints,andistypicallyappliedtoobjectsinacontinuousn-dimensionalspace.K-medoiddefinesaprototypeintermsofamedoid,whichisthemostrepresentativepointforagroupofpoints,andcanbeappliedtoawiderangeofdatasinceitrequiresonlyaproximitymeasureforapairofobjects.Whileacentroidalmostnevercorrespondstoanactualdatapoint,amedoid,byitsdefinition,mustbeanactualdatapoint.Inthissection,wewillfocussolelyonK-means,whichisoneoftheoldestandmostwidely-usedclusteringalgorithms.

7.2.1TheBasicK-meansAlgorithm

TheK-meansclusteringtechniqueissimple,andwebeginwithadescriptionofthebasicalgorithm.WefirstchooseKinitialcentroids,whereKisauser-specifiedparameter,namely,thenumberofclustersdesired.Eachpointisthenassignedtotheclosestcentroid,andeachcollectionofpointsassignedtoacentroidisacluster.Thecentroidofeachclusteristhenupdatedbasedonthepointsassignedtothecluster.Werepeattheassignmentandupdatestepsuntilnopointchangesclusters,orequivalently,untilthecentroidsremainthesame.

K-meansisformallydescribedbyAlgorithm7.1. TheoperationofK-meansisillustratedinFigure7.3 ,whichshowshow,startingfromthree

centroids,thefinalclustersarefoundinfourassignment-updatesteps.IntheseandotherfiguresdisplayingK-meansclustering,eachsubfigureshows(1)thecentroidsatthestartoftheiterationand(2)theassignmentofthepointstothosecentroids.Thecentroidsareindicatedbythe“+”symbol;allpointsbelongingtothesameclusterhavethesamemarkershape.

Figure7.3.UsingtheK-meansalgorithmtofindthreeclustersinsampledata.

Algorithm7.1BasicK-meansalgorithm.

Inthefirststep,showninFigure7.3(a) ,pointsareassignedtotheinitialcentroids,whichareallinthelargergroupofpoints.Forthisexample,weusethemeanasthecentroid.Afterpointsareassignedtoacentroid,thecentroid

SelectKpointsasinitialcentroids.repeatFormKclustersbyassigningeachpointtoitsclosestcentroid.Recomputethecentroidofeachcluster.untilCentroidsdonotchange.

1:2:3:4:5:

isthenupdated.Again,thefigureforeachstepshowsthecentroidatthebeginningofthestepandtheassignmentofpointstothosecentroids.Inthesecondstep,pointsareassignedtotheupdatedcentroids,andthecentroidsareupdatedagain.Insteps2,3,and4,whichareshowninFigures7.3(b) ,(c) ,and(d) ,respectively,twoofthecentroidsmovetothetwosmallgroupsofpointsatthebottomofthefigures.WhentheK-meansalgorithmterminatesinFigure7.3(d) ,becausenomorechangesoccur,thecentroidshaveidentifiedthenaturalgroupingsofpoints.

Foranumberofcombinationsofproximityfunctionsandtypesofcentroids,K-meansalwaysconvergestoasolution;i.e.,K-meansreachesastateinwhichnopointsareshiftingfromoneclustertoanother,andhence,thecentroidsdon’tchange.Becausemostoftheconvergenceoccursintheearlysteps,however,theconditiononline5ofAlgorithm7.1 isoftenreplacedbyaweakercondition,e.g.,repeatuntilonly1%ofthepointschangeclusters.

WeconsidereachofthestepsinthebasicK-meansalgorithminmoredetailandthenprovideananalysisofthealgorithm’sspaceandtimecomplexity.

AssigningPointstotheClosestCentroidToassignapointtotheclosestcentroid,weneedaproximitymeasurethatquantifiesthenotionof“closest”forthespecificdataunderconsideration.Euclidean distanceisoftenusedfordatapointsinEuclideanspace,whilecosinesimilarityismoreappropriatefordocuments.However,severaltypesofproximitymeasurescanbeappropriateforagiventypeofdata.Forexample,Manhattan distancecanbeusedforEuclideandata,whiletheJaccardmeasureisoftenemployedfordocuments.

Usually,thesimilaritymeasuresusedforK-meansarerelativelysimplesincethealgorithmrepeatedlycalculatesthesimilarityofeachpointtoeach

(L2)

(L1)

centroid.Insomecases,however,suchaswhenthedataisinlow-dimensionalEuclideanspace,itispossibletoavoidcomputingmanyofthesimilarities,thussignificantlyspeedinguptheK-meansalgorithm.BisectingK-means(describedinSection7.2.3 )isanotherapproachthatspeedsupK-meansbyreducingthenumberofsimilaritiescomputed.

CentroidsandObjectiveFunctionsStep4oftheK-meansalgorithmwasstatedrathergenerallyas“recomputethecentroidofeachcluster,”sincethecentroidcanvary,dependingontheproximitymeasureforthedataandthegoaloftheclustering.Thegoaloftheclusteringistypicallyexpressedbyanobjectivefunctionthatdependsontheproximitiesofthepointstooneanotherortotheclustercentroids;e.g.,minimizethesquareddistanceofeachpointtoitsclosestcentroid.Weillustratethiswithtwoexamples.However,thekeypointisthis:afterwehavespecifiedaproximitymeasureandanobjectivefunction,thecentroidthatweshouldchoosecanoftenbedeterminedmathematically.WeprovidemathematicaldetailsinSection7.2.6 ,andprovideanon-mathematicaldiscussionofthisobservationhere.

DatainEuclideanSpace

ConsiderdatawhoseproximitymeasureisEuclideandistance.Forourobjectivefunction,whichmeasuresthequalityofaclustering,weusethesumofthesquarederror(SSE),whichisalsoknownasscatter.Inotherwords,wecalculatetheerrorofeachdatapoint,i.e.,itsEuclideandistancetotheclosestcentroid,andthencomputethetotalsumofthesquarederrors.GiventwodifferentsetsofclustersthatareproducedbytwodifferentrunsofK-means,weprefertheonewiththesmallestsquarederrorsincethismeansthattheprototypes(centroids)ofthisclusteringareabetterrepresentationof

thepointsintheircluster.UsingthenotationinTable7.1 ,theSSEisformallydefinedasfollows:

Table7.1.Tableofnotation.

Symbol Description

x Anobject.

The cluster.

Thecentroidofcluster .

c Thecentroidofallpoints.

Thenumberofobjectsinthe cluster.

m Thenumberofobjectsinthedataset.

K Thenumberofclusters.

wheredististhestandardEuclidean distancebetweentwoobjectsinEuclideanspace.

Giventheseassumptions,itcanbeshown(seeSection7.2.6 )thatthecentroidthatminimizestheSSEoftheclusteristhemean.UsingthenotationinTable7.1 ,thecentroid(mean)ofthe clusterisdefinedbyEquation7.2 .

Ci ith

ci Ci

mi ith

SSE=∑i=1K∑x∈Cidist(ci,x)2 (7.1)

(L2)

ith

ci=1mi∑x∈Cix (7.2)

Toillustrate,thecentroidofaclustercontainingthethreetwo-dimensionalpoints,(1,1),(2,3),and(6,2),is , .

Steps3and4oftheK-meansalgorithmdirectlyattempttominimizetheSSE(ormoregenerally,theobjectivefunction).Step3formsclustersbyassigningpointstotheirnearestcentroid,whichminimizestheSSEforthegivensetofcentroids.Step4recomputesthecentroidssoastofurtherminimizetheSSE.However,theactionsofK-meansinSteps3and4areguaranteedtoonlyfindalocalminimumwithrespecttotheSSEbecausetheyarebasedonoptimizingtheSSEforspecificchoicesofthecentroidsandclusters,ratherthanforallpossiblechoices.Wewilllaterseeanexampleinwhichthisleadstoasuboptimalclustering.

DocumentData

ToillustratethatK-meansisnotrestrictedtodatainEuclideanspace,weconsiderdocumentdataandthecosinesimilaritymeasure.Hereweassumethatthedocumentdataisrepresentedasadocument-termmatrixasdescribedonpage37.Ourobjectiveistomaximizethesimilarityofthedocumentsinaclustertotheclustercentroid;thisquantityisknownasthecohesionofthecluster.Forthisobjectiveitcanbeshownthattheclustercentroidis,asforEuclideandata,themean.TheanalogousquantitytothetotalSSEisthetotalcohesion,whichisgivenbyEquation7.3 .

TheGeneralCase

Thereareanumberofchoicesfortheproximityfunction,centroid,andobjectivefunctionthatcanbeusedinthebasicK-meansalgorithmandthatareguaranteedtoconverge.Table7.2 showssomepossiblechoices,

((1+2+6)/3 ((1+3+2)/3)=(3,2)

TotalCohesion=∑i=1K∑x∈Cicosine(x,ci) (7.3)

includingthetwothatwehavejustdiscussed.NoticethatforManhattandistanceandtheobjectiveofminimizingthesumofthedistances,theappropriatecentroidisthemedianofthepointsinacluster.

Table7.2.K-means:Commonchoicesforproximity,centroids,andobjectivefunctions.

ProximityFunction Centroid ObjectiveFunction

Manhattan median Minimizesumofthe distanceofanobjecttoitsclustercentroid

SquaredEuclidean mean Minimizesumofthesquared distanceofanobjecttoitsclustercentroid

cosine mean Maximizesumofthecosinesimilarityofanobjecttoitsclustercentroid

Bregmandivergence mean MinimizesumoftheBregmandivergenceofanobjecttoitsclustercentroid

Thelastentryinthetable,Bregmandivergence(Section2.4.8 ),isactuallyaclassofproximitymeasuresthatincludesthesquaredEuclideandistance,

,theMahalanobisdistance,andcosinesimilarity.TheimportanceofBregmandivergencefunctionsisthatanysuchfunctioncanbeusedasthebasisofaK-meansstyleclusteringalgorithmwiththemeanasthecentroid.Specifically,ifweuseaBregmandivergenceasourproximityfunction,thentheresultingclusteringalgorithmhastheusualpropertiesofK-meanswithrespecttoconvergence,localminima,etc.Furthermore,thepropertiesofsuchaclusteringalgorithmcanbedevelopedforallpossibleBregmandivergences.Forexample,K-meansalgorithmsthatusecosinesimilarityorsquaredEuclideandistanceareparticularinstancesofageneralclusteringalgorithmbasedonBregmandivergences.

(L1)

(L1) L1

(L22)L2

L22

FortherestofourK-meansdiscussion,weusetwo-dimensionaldatasinceitiseasytoexplainK-meansanditspropertiesforthistypeofdata.But,assuggestedbythelastfewparagraphs,K-meansisageneralclusteringalgorithmandcanbeusedwithawidevarietyofdatatypes,suchasdocumentsandtimeseries.

ChoosingInitialCentroidsWhenrandominitializationofcentroidsisused,differentrunsofK-meanstypicallyproducedifferenttotalSSEs.Weillustratethiswiththesetoftwo-dimensionalpointsshowninFigure7.3 ,whichhasthreenaturalclustersofpoints.Figure7.4(a) showsaclusteringsolutionthatistheglobalminimumoftheSSEforthreeclusters,whileFigure7.4(b) showsasuboptimalclusteringthatisonlyalocalminimum.

Figure7.4.Threeoptimalandnon-optimalclusters.

ChoosingtheproperinitialcentroidsisthekeystepofthebasicK-meansprocedure.Acommonapproachistochoosetheinitialcentroidsrandomly,buttheresultingclustersareoftenpoor.

Example7.1(PoorInitialCentroids).Randomlyselectedinitialcentroidscanbepoor.WeprovideanexampleofthisusingthesamedatasetusedinFigures7.3 and7.4 .Figures7.3 and7.5 showtheclustersthatresultfromtwoparticularchoicesofinitialcentroids.(Forbothfigures,thepositionsoftheclustercentroidsinthevariousiterationsareindicatedbycrosses.)InFigure7.3 ,eventhoughalltheinitialcentroidsarefromonenaturalcluster,theminimumSSEclusteringisstillfound.InFigure7.5 ,however,eventhoughtheinitialcentroidsseemtobebetterdistributed,weobtainasuboptimalclustering,withhighersquarederror.

Figure7.5.PoorstartingcentroidsforK-means.

Example7.2(LimitsofRandomInitialization).Onetechniquethatiscommonlyusedtoaddresstheproblemofchoosinginitialcentroidsistoperformmultipleruns,eachwithadifferentsetofrandomlychoseninitialcentroids,andthenselectthesetofclusterswiththeminimumSSE.Whilesimple,thisstrategymightnotworkverywell,dependingonthedatasetandthenumberofclusterssought.We

demonstratethisusingthesampledatasetshowninFigure7.6(a) .Thedataconsistsoftwopairsofclusters,wheretheclustersineach(top-bottom)pairareclosertoeachotherthantotheclustersintheotherpair.Figure7.6(b–d) showsthatifwestartwithtwoinitialcentroidsperpairofclusters,thenevenwhenbothcentroidsareinasinglecluster,thecentroidswillredistributethemselvessothatthe“true”clustersarefound.However,Figure7.7 showsthatifapairofclustershasonlyoneinitialcentroidandtheotherpairhasthree,thentwoofthetrueclusterswillbecombinedandonetrueclusterwillbesplit.

Figure7.6.Twopairsofclusterswithapairofinitialcentroidswithineachpairofclusters.

Figure7.7.Twopairsofclusterswithmoreorfewerthantwoinitialcentroidswithinapairofclusters.

Notethatanoptimalclusteringwillbeobtainedaslongastwoinitialcentroidsfallanywhereinapairofclusters,sincethecentroidswillredistributethemselves,onetoeachcluster.Unfortunately,asthenumberofclustersbecomeslarger,itisincreasinglylikelythatatleastonepairofclusterswillhaveonlyoneinitialcentroid—seeExercise4 onpage603.

Inthiscase,becausethepairsofclustersarefartherapartthanclusterswithinapair,theK-meansalgorithmwillnotredistributethecentroidsbetweenpairsofclusters,andthus,onlyalocalminimumwillbeachieved.

Becauseoftheproblemswithusingrandomlyselectedinitialcentroids,whichevenrepeatedrunsmightnotovercome,othertechniquesareoftenemployedforinitialization.Oneeffectiveapproachistotakeasampleofpointsandclusterthemusingahierarchicalclusteringtechnique.Kclustersareextractedfromthehierarchicalclustering,andthecentroidsofthoseclustersareusedastheinitialcentroids.Thisapproachoftenworkswell,butispracticalonlyif(1)thesampleisrelativelysmall,e.g.,afewhundredtoafewthousand(hierarchicalclusteringisexpensive),and(2)Kisrelativelysmallcomparedtothesamplesize.

Thefollowingprocedureisanotherapproachtoselectinginitialcentroids.Selectthefirstpointatrandomortakethecentroidofallpoints.Then,foreachsuccessiveinitialcentroid,selectthepointthatisfarthestfromanyoftheinitialcentroidsalreadyselected.Inthisway,weobtainasetofinitialcentroidsthatisguaranteedtobenotonlyrandomlyselectedbutalsowellseparated.Unfortunately,suchanapproachcanselectoutliers,ratherthanpointsindenseregions(clusters),whichcanleadtoasituationwheremanyclustershavejustonepoint—anoutlier—whichreducesthenumberofcentroidsforformingclustersforthemajorityofpoints.Also,itisexpensivetocomputethefarthestpointfromthecurrentsetofinitialcentroids.Toovercometheseproblems,thisapproachisoftenappliedtoasampleofthepoints.Becauseoutliersarerare,theytendnottoshowupinarandomsample.Incontrast,pointsfromeverydenseregionarelikelytobeincludedunlessthesamplesizeisverysmall.Also,thecomputationinvolvedinfindingtheinitialcentroidsisgreatlyreducedbecausethesamplesizeistypicallymuchsmallerthanthenumberofpoints.

K-means++

Morerecently,anewapproachforinitializingK-means,calledK-means++,hasbeendeveloped.ThisprocedureisguaranteedtofindaK-meansclusteringsolutionthatisoptimaltowithinafactorof ,whichinpracticetranslatesintonoticeablybetterclusteringresultsintermsoflowerSSE.Thistechniqueissimilartotheideajustdiscussedofpickingthefirstcentroidatrandomandthenpickingeachremainingcentroidasthepointasfarfromtheremainingcentroidsaspossible.Specifically,K-means++pickscentroidsincrementallyuntilkcentroidshavebeenpicked.Ateverysuchstep,eachpointhasaprobabilityofbeingpickedasthenewcentroidthatisproportionaltothesquareofitsdistancetoitsclosestcentroid.Itmightseemthatthisapproachmighttendtochooseoutliersforcentroids,butbecauseoutliersarerare,bydefinition,thisisunlikely.

ThedetailsofK-means++initializationaregivenbyAlgorithm7.2. TherestofthealgorithmisthesameasordinaryK-means.

Algorithm7.2K-means++initialization

algorithm.

Olog(k)

Forthefirstcentroid,pickoneofthepointsatrandom.for tonumberoftrialsdoComputethedistance, ,ofeachpointtoitsclosestcentroid.Assigneachpointaprobabilityproportionaltoeachpoint’s

.Picknewcentroidfromtheremainingpointsusingtheweightedprobabilities.endfor

1:2: i=13: d(x)4:

d(x)25:

6:

Later,wewilldiscusstwootherapproachesthatarealsousefulforproducingbetter-quality(lowerSSE)clusterings:usingavariantofK-meansthatislesssusceptibletoinitializationproblems(bisectingK-means)andusingpostprocessingto“fixup”thesetofclustersproduced.K-means++couldbecombinedwitheitherapproach.

TimeandSpaceComplexityThespacerequirementsforK-meansaremodestbecauseonlythedatapointsandcentroidsarestored.Specifically,thestoragerequiredis

,wheremisthenumberofpointsandnisthenumberofattributes.ThetimerequirementsforK-meansarealsomodest—basicallylinearinthenumberofdatapoints.Inparticular,thetimerequiredis ,whereIisthenumberofiterationsrequiredforconvergence.Asmentioned,Iisoftensmallandcanusuallybesafelybounded,asmostchangestypicallyoccurinthefirstfewiterations.Therefore,K-meansislinearinm,thenumberofpoints,andisefficientaswellassimpleprovidedthatK,thenumberofclusters,issignificantlylessthanm.

7.2.2K-means:AdditionalIssues

HandlingEmptyClustersOneoftheproblemswiththebasicK-meansalgorithmisthatemptyclusterscanbeobtainedifnopointsareallocatedtoaclusterduringtheassignmentstep.Ifthishappens,thenastrategyisneededtochooseareplacementcentroid,sinceotherwise,thesquarederrorwillbelargerthannecessary.Oneapproachistochoosethepointthatisfarthestawayfromanycurrentcentroid.Ifnothingelse,thiseliminatesthepointthatcurrentlycontributes

O((m+K)n)

O(I×K×m×n)

mosttothetotalsquarederror.(AK-means++approachcouldbeusedaswell.)AnotherapproachistochoosethereplacementcentroidatrandomfromtheclusterthathasthehighestSSE.ThiswilltypicallysplittheclusterandreducetheoverallSSEoftheclustering.Ifthereareseveralemptyclusters,thenthisprocesscanberepeatedseveraltimes.

OutliersWhenthesquarederrorcriterionisused,outlierscanundulyinfluencetheclustersthatarefound.Inparticular,whenoutliersarepresent,theresultingclustercentroids(prototypes)aretypicallynotasrepresentativeastheyotherwisewouldbeandthus,theSSEwillbehigher.Becauseofthis,itisoftenusefultodiscoveroutliersandeliminatethembeforehand.Itisimportant,however,toappreciatethattherearecertainclusteringapplicationsforwhichoutliersshouldnotbeeliminated.Whenclusteringisusedfordatacompression,everypointmustbeclustered,andinsomecases,suchasfinancialanalysis,apparentoutliers,e.g.,unusuallyprofitablecustomers,canbethemostinterestingpoints.

Anobviousissueishowtoidentifyoutliers.AnumberoftechniquesforidentifyingoutlierswillbediscussedinChapter9 .Ifweuseapproachesthatremoveoutliersbeforeclustering,weavoidclusteringpointsthatwillnotclusterwell.Alternatively,outlierscanalsobeidentifiedinapostprocessingstep.Forinstance,wecankeeptrackoftheSSEcontributedbyeachpoint,andeliminatethosepointswithunusuallyhighcontributions,especiallyovermultipleruns.Also,weoftenwanttoeliminatesmallclustersbecausetheyfrequentlyrepresentgroupsofoutliers.

ReducingtheSSEwithPostprocessing

AnobviouswaytoreducetheSSEistofindmoreclusters,i.e.,tousealargerK.However,inmanycases,wewouldliketoimprovetheSSE,butdon’twanttoincreasethenumberofclusters.ThisisoftenpossiblebecauseK-meanstypicallyconvergestoalocalminimum.Varioustechniquesareusedto“fixup”theresultingclustersinordertoproduceaclusteringthathaslowerSSE.ThestrategyistofocusonindividualclusterssincethetotalSSEissimplythesumoftheSSEcontributedbyeachcluster.(WewillusethetermstotalSSEandclusterSSE,respectively,toavoidanypotentialconfusion.)WecanchangethetotalSSEbyperformingvariousoperationsontheclusters,suchassplittingormergingclusters.Onecommonlyusedapproachistoemployalternateclustersplittingandmergingphases.Duringasplittingphase,clustersaredivided,whileduringamergingphase,clustersarecombined.Inthisway,itisoftenpossibletoescapelocalSSEminimaandstillproduceaclusteringsolutionwiththedesirednumberofclusters.Thefollowingaresometechniquesusedinthesplittingandmergingphases.

TwostrategiesthatdecreasethetotalSSEbyincreasingthenumberofclustersarethefollowing:

Splitacluster:TheclusterwiththelargestSSEisusuallychosen,butwecouldalsosplittheclusterwiththelargeststandarddeviationforoneparticularattribute.

Introduceanewclustercentroid:Oftenthepointthatisfarthestfromanyclustercenterischosen.WecaneasilydeterminethisifwekeeptrackoftheSSEcontributedbyeachpoint.AnotherapproachistochooserandomlyfromallpointsorfromthepointswiththehighestSSEwithrespecttotheirclosestcentroids.

Twostrategiesthatdecreasethenumberofclusters,whiletryingtominimizetheincreaseintotalSSE,arethefollowing:

Disperseacluster:Thisisaccomplishedbyremovingthecentroidthatcorrespondstotheclusterandreassigningthepointstootherclusters.Ideally,theclusterthatisdispersedshouldbetheonethatincreasesthetotalSSEtheleast.

Mergetwoclusters:Theclusterswiththeclosestcentroidsaretypicallychosen,althoughanother,perhapsbetter,approachistomergethetwoclustersthatresultinthesmallestincreaseintotalSSE.ThesetwomergingstrategiesarethesameonesthatareusedinthehierarchicalclusteringtechniquesknownasthecentroidmethodandWard’smethod,respectively.BothmethodsarediscussedinSection7.3 .

UpdatingCentroidsIncrementallyInsteadofupdatingclustercentroidsafterallpointshavebeenassignedtoacluster,thecentroidscanbeupdatedincrementally,aftereachassignmentofapointtoacluster.Noticethatthisrequireseitherzeroortwoupdatestoclustercentroidsateachstep,sinceapointeithermovestoanewcluster(twoupdates)orstaysinitscurrentcluster(zeroupdates).Usinganincrementalupdatestrategyguaranteesthatemptyclustersarenotproducedbecauseallclustersstartwithasinglepoint,andifaclustereverhasonlyonepoint,thenthatpointwillalwaysbereassignedtothesamecluster.

Inaddition,ifincrementalupdatingisused,therelativeweightofthepointbeingaddedcanbeadjusted;e.g.,theweightofpointsisoftendecreasedastheclusteringproceeds.Whilethiscanresultinbetteraccuracyandfasterconvergence,itcanbedifficulttomakeagoodchoicefortherelativeweight,especiallyinawidevarietyofsituations.Theseupdateissuesaresimilartothoseinvolvedinupdatingweightsforartificialneuralnetworks.

Yetanotherbenefitofincrementalupdateshastodowithusingobjectivesotherthan“minimizeSSE.”Supposethatwearegivenanarbitraryobjectivefunctiontomeasurethegoodnessofasetofclusters.Whenweprocessanindividualpoint,wecancomputethevalueoftheobjectivefunctionforeachpossibleclusterassignment,andthenchoosetheonethatoptimizestheobjective.SpecificexamplesofalternativeobjectivefunctionsaregiveninSection7.5.2 .

Onthenegativeside,updatingcentroidsincrementallyintroducesanorderdependency.Inotherwords,theclustersproducedusuallydependontheorderinwhichthepointsareprocessed.Althoughthiscanbeaddressedbyrandomizingtheorderinwhichthepointsareprocessed,thebasicK-meansapproachofupdatingthecentroidsafterallpointshavebeenassignedtoclustershasnoorderdependency.Also,incrementalupdatesareslightlymoreexpensive.However,K-meansconvergesratherquickly,andtherefore,thenumberofpointsswitchingclustersquicklybecomesrelativelysmall.

7.2.3BisectingK-means

ThebisectingK-meansalgorithmisastraightforwardextensionofthebasicK-meansalgorithmthatisbasedonasimpleidea:toobtainKclusters,splitthesetofallpointsintotwoclusters,selectoneoftheseclusterstosplit,andsoon,untilKclustershavebeenproduced.ThedetailsofbisectingK-meansaregivenbyAlgorithm7.3.

Thereareanumberofdifferentwaystochoosewhichclustertosplit.Wecanchoosethelargestclusterateachstep,choosetheonewiththelargestSSE,oruseacriterionbasedonbothsizeandSSE.Differentchoicesresultindifferentclusters.

BecauseweareusingtheK-meansalgorithm“locally,”i.e.,tobisectindividualclusters,thefinalsetofclustersdoesnotrepresentaclusteringthatisalocalminimumwithrespecttothetotalSSE.Thus,weoftenrefinetheresultingclustersbyusingtheirclustercentroidsastheinitialcentroidsforthestandardK-meansalgorithm.

Algorithm7.3BisectingK-means

algorithm.

Example7.3(BisectingK-meansandInitialization).ToillustratethatbisectingK-meansislesssusceptibletoinitializationproblems,weshow,inFigure7.8 ,howbisectingK-meansfindsfourclustersinthedatasetoriginallyshowninFigure7.6(a) .Initeration1,twopairsofclustersarefound;initeration2,therightmostpairofclusters

Initializethelistofclusterstocontaintheclusterconsistingofallpoints.repeatRemoveaclusterfromthelistofclusters.{Performseveral“trial”bisectionsofthechosencluster.}for tonumberoftrialsdoBisecttheselectedclusterusingbasicK-means.endforSelectthetwoclustersfromthebisectionwiththelowesttotalSSE.Addthesetwoclusterstothelistofclusters.untilThelistofclusterscontainsKclusters.

1:

2:3:4:5: i=16:7:8:

9:10:

issplit;andiniteration3,theleftmostpairofclustersissplit.BisectingK-meanshaslesstroublewithinitializationbecauseitperformsseveraltrialbisectionsandtakestheonewiththelowestSSE,andbecausethereareonlytwocentroidsateachstep.

Figure7.8.BisectingK-meansonthefourclustersexample.

Finally,byrecordingthesequenceofclusteringsproducedasK-meansbisectsclusters,wecanalsousebisectingK-meanstoproduceahierarchicalclustering.

7.2.4K-meansandDifferentTypesofClusters

K-meansanditsvariationshaveanumberoflimitationswithrespecttofindingdifferenttypesofclusters.Inparticular,K-meanshasdifficultydetectingthe“natural”clusters,whenclustershavenon-sphericalshapesorwidelydifferentsizesordensities.ThisisillustratedbyFigures7.9 ,7.10 ,and7.11 .InFigure7.9 ,K-meanscannotfindthethreenaturalclustersbecauseoneoftheclustersismuchlargerthantheothertwo,andhence,thelargerclusteris

broken,whileoneofthesmallerclustersiscombinedwithaportionofthelargercluster.InFigure7.10 ,K-meansfailstofindthethreenaturalclustersbecausethetwosmallerclustersaremuchdenserthanthelargercluster.Finally,inFigure7.11 ,K-meansfindstwoclustersthatmixportionsofthetwonaturalclustersbecausetheshapeofthenaturalclustersisnotglobular.

Figure7.9.K-meanswithclustersofdifferentsize.

Figure7.10.K-meanswithclustersofdifferentdensity.

Figure7.11.K-meanswithnon-globularclusters.

ThedifficultyinthesethreesituationsisthattheK-meansobjectivefunctionisamismatchforthekindsofclusterswearetryingtofindbecauseitisminimizedbyglobularclustersofequalsizeanddensityorbyclustersthatarewellseparated.However,theselimitationscanbeovercome,insomesense,iftheuseriswillingtoacceptaclusteringthatbreaksthenaturalclustersintoanumberofsubclusters.Figure7.12 showswhathappenstothethreepreviousdatasetsifwefindsixclustersinsteadoftwoorthree.Eachsmallerclusterispureinthesensethatitcontainsonlypointsfromoneofthenaturalclusters.

Figure7.12.UsingK-meanstofindclustersthataresubclustersofthenaturalclusters.

7.2.5StrengthsandWeaknesses

K-meansissimpleandcanbeusedforawidevarietyofdatatypes.Itisalsoquiteefficient,eventhoughmultiplerunsareoftenperformed.Somevariants,includingbisectingK-means,areevenmoreefficient,andarelesssusceptibletoinitializationproblems.K-meansisnotsuitableforalltypesofdata,however.Itcannothandlenon-globularclustersorclustersofdifferentsizesanddensities,althoughitcantypicallyfindpuresubclustersifalargeenoughnumberofclustersisspecified.K-meansalsohastroubleclusteringdatathatcontainsoutliers.Outlierdetectionandremovalcanhelpsignificantlyinsuchsituations.Finally,K-meansisrestrictedtodataforwhichthereisanotionofacenter(centroid).Arelatedtechnique,K-medoidclustering,doesnothavethisrestriction,butismoreexpensive.

7.2.6K-meansasanOptimizationProblem

Here,wedelveintothemathematicsbehindK-means.Thissection,whichcanbeskippedwithoutlossofcontinuity,requiresknowledgeofcalculusthroughpartialderivatives.Familiaritywithoptimizationtechniques,especiallythosebasedongradientdescent,canalsobehelpful.

Asmentionedearlier,givenanobjectivefunctionsuchas“minimizeSSE,”clusteringcanbetreatedasanoptimizationproblem.Onewaytosolvethisproblem—tofindaglobaloptimum—istoenumerateallpossiblewaysofdividingthepointsintoclustersandthenchoosethesetofclustersthatbestsatisfiestheobjectivefunction,e.g.,thatminimizesthetotalSSE.Ofcourse,thisexhaustivestrategyiscomputationallyinfeasibleandasaresult,amorepracticalapproachisneeded,evenifsuchanapproachfindssolutionsthatarenotguaranteedtobeoptimal.Onetechnique,whichisknownasgradientdescent,isbasedonpickinganinitialsolutionandthenrepeatingthefollowingtwosteps:computethechangetothesolutionthatbestoptimizestheobjectivefunctionandthenupdatethesolution.

Weassumethatthedataisone-dimensional,i.e., .Thisdoesnotchangeanythingessential,butgreatlysimplifiesthenotation.

DerivationofK-meansasanAlgorithmtoMinimizetheSSEInthissection,weshowhowthecentroidfortheK-meansalgorithmcanbemathematicallyderivedwhentheproximityfunctionisEuclideandistanceandtheobjectiveistominimizetheSSE.Specifically,weinvestigatehowwecanbestupdateaclustercentroidsothattheclusterSSEisminimized.Inmathematicalterms,weseektominimizeEquation7.1 ,whichwerepeathere,specializedforone-dimensionaldata.

Here, isthe cluster,xisapointin ,and isthemeanofthecluster.SeeTable7.1 foracompletelistofnotation.

dist(x,y)=(x−y)2

SSE=∑i=1K∑x∈Ci(ci−x)2 (7.4)

Ci ith Ci ci ith

Wecansolveforthe centroid ,whichminimizesEquation7.4 ,bydifferentiatingtheSSE,settingitequalto0,andsolving,asindicatedbelow.

Thus,aspreviouslyindicated,thebestcentroidforminimizingtheSSEofaclusteristhemeanofthepointsinthecluster.

DerivationofK-meansforSAETodemonstratethattheK-meansalgorithmcanbeappliedtoavarietyofdifferentobjectivefunctions,weconsiderhowtopartitionthedataintoKclusterssuchthatthesumoftheManhattan distancesofpointsfromthecenteroftheirclustersisminimized.Weareseekingtominimizethesumofthe absoluteerrors(SAE)asgivenbythefollowingequation,whereisthe distance.Again,fornotationalsimplicity,weuseone-dimensionaldata,i.e., .

Wecansolveforthe centroid ,whichminimizesEquation7.5 ,bydifferentiatingtheSAE,settingitequalto0,andsolving.

kth ck

∂∂ckSSE=∂∂ck∑i=1K∑x∈Ci(ci−x)2=∑i=1K∑x∈Ci∂∂ck(ci−x)2=∑x∈Ck2×(ck−xk)=0

∑x∈Ck2×(ck−xk)=0⇒mkck=∑x∈Ckxk⇒ck=1mk∑x∈Ckxk

(L1)

L1 distL1L1

distL1=|ci−x|

SAE=∑i=1K∑x∈CidistL1(ci,x) (7.5)

kth ck

∂∂ckSAE=∂∂ck∑i=1K∑x∈Ci|ci−x|=∑i=1K∑x∈Ci∂∂ck|ci−x|=∑x∈Ck∂∂ck|ck−x|=0

∑x∈Ck∂∂ck|ck−x|=0⇒∑x∈Cksign(x−ck)=0

Ifwesolvefor ,wefindthat ,themedianofthepointsinthecluster.Themedianofagroupofpointsisstraightforwardtocomputeandlesssusceptibletodistortionbyoutliers.

ck ck=median{x∈Ck}

7.3AgglomerativeHierarchicalClusteringHierarchicalclusteringtechniquesareasecondimportantcategoryofclusteringmethods.AswithK-means,theseapproachesarerelativelyoldcomparedtomanyclusteringalgorithms,buttheystillenjoywidespreaduse.Therearetwobasicapproachesforgeneratingahierarchicalclustering:

Agglomerative:Startwiththepointsasindividualclustersand,ateachstep,mergetheclosestpairofclusters.Thisrequiresdefininganotionofclusterproximity.

Divisive:Startwithone,all-inclusiveclusterand,ateachstep,splitaclusteruntilonlysingletonclustersofindividualpointsremain.Inthiscase,weneedtodecidewhichclustertosplitateachstepandhowtodothesplitting.

Agglomerativehierarchicalclusteringtechniquesarebyfarthemostcommon,and,inthissection,wewillfocusexclusivelyonthesemethods.AdivisivehierarchicalclusteringtechniqueisdescribedinSection8.4.2 .

Ahierarchicalclusteringisoftendisplayedgraphicallyusingatree-likediagramcalledadendrogram,whichdisplaysboththecluster-subclusterrelationshipsandtheorderinwhichtheclustersweremerged(agglomerativeview)orsplit(divisiveview).Forsetsoftwo-dimensionalpoints,suchasthosethatwewilluseasexamples,ahierarchicalclusteringcanalsobegraphicallyrepresentedusinganestedclusterdiagram.Figure7.13 showsanexampleofthesetwotypesoffiguresforasetoffourtwo-dimensionalpoints.

Thesepointswereclusteredusingthesingle-linktechniquethatisdescribedinSection7.3.2 .

Figure7.13.Ahierarchicalclusteringoffourpointsshownasadendrogramandasnestedclusters.

7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm

Manyagglomerativehierarchicalclusteringtechniquesarevariationsonasingleapproach:startingwithindividualpointsasclusters,successivelymergethetwoclosestclustersuntilonlyoneclusterremains.ThisapproachisexpressedmoreformallyinAlgorithm7.4 .

Algorithm7.4Basicagglomerative

hierarchicalclusteringalgorithm.

DefiningProximitybetweenClustersThekeyoperationofAlgorithm7.4 isthecomputationoftheproximitybetweentwoclusters,anditisthedefinitionofclusterproximitythatdifferentiatesthevariousagglomerativehierarchicaltechniquesthatwewilldiscuss.Clusterproximityistypicallydefinedwithaparticulartypeofclusterinmind—seeSection7.1.3 .Forexample,manyagglomerativehierarchicalclusteringtechniques,suchasMIN,MAX,andGroupAverage,comefromagraph-basedviewofclusters.MINdefinesclusterproximityastheproximitybetweentheclosesttwopointsthatareindifferentclusters,orusinggraphterms,theshortestedgebetweentwonodesindifferentsubsetsofnodes.Thisyieldscontiguity-basedclustersasshowninFigure7.2(c) .Alternatively,MAXtakestheproximitybetweenthefarthesttwopointsindifferentclusterstobetheclusterproximity,orusinggraphterms,thelongestedgebetweentwonodesindifferentsubsetsofnodes.(Ifourproximitiesaredistances,thenthenames,MINandMAX,areshortandsuggestive.Forsimilarities,however,wherehighervaluesindicatecloserpoints,thenamesseemreversed.Forthatreason,weusuallyprefertousethealternativenames,singlelinkandcompletelink,respectively.)Anothergraph-basedapproach,thegroupaveragetechnique,definesclusterproximitytobetheaveragepairwiseproximities(averagelengthofedges)ofallpairsofpointsfromdifferentclusters.Figure7.14 illustratesthesethreeapproaches.

Computetheproximitymatrix,ifnecessary.repeatMergetheclosesttwoclusters.Updatetheproximitymatrixtoreflecttheproximitybetweenthenewclusterandtheoriginalclusters.untilOnlyoneclusterremains.

1:2:3:4:

5:

Figure7.14.Graph-baseddefinitionsofclusterproximity.

If,instead,wetakeaprototype-basedview,inwhicheachclusterisrepresentedbyacentroid,differentdefinitionsofclusterproximityaremorenatural.Whenusingcentroids,theclusterproximityiscommonlydefinedastheproximitybetweenclustercentroids.Analternativetechnique,Ward’smethod,alsoassumesthataclusterisrepresentedbyitscentroid,butitmeasurestheproximitybetweentwoclustersintermsoftheincreaseintheSSEthatresultsfrommergingthetwoclusters.LikeK-means,Ward’smethodattemptstominimizethesumofthesquareddistancesofpointsfromtheirclustercentroids.

TimeandSpaceComplexityThebasicagglomerativehierarchicalclusteringalgorithmjustpresentedusesaproximitymatrix.Thisrequiresthestorageof proximities(assumingtheproximitymatrixissymmetric)wheremisthenumberofdatapoints.Thespaceneededtokeeptrackoftheclustersisproportionaltothenumberofclusters,whichis ,excludingsingletonclusters.Hence,thetotalspacecomplexityis .

Theanalysisofthebasicagglomerativehierarchicalclusteringalgorithmisalsostraightforwardwithrespecttocomputationalcomplexity. timeisrequiredtocomputetheproximitymatrix.Afterthatstep,thereare

12m2

m−1O(m2)

O(m2)m−1

iterationsinvolvingsteps3and4becausetherearemclustersatthestartandtwoclustersaremergedduringeachiteration.Ifperformedasalinearsearchoftheproximitymatrix,thenforthe iteration,Step3requirestime,whichisproportionaltothecurrentnumberofclusterssquared.Step4requires timetoupdatetheproximitymatrixafterthemergeroftwoclusters.(Aclustermergeraffects proximitiesforthetechniquesthatweconsider.)Withoutmodification,thiswouldyieldatimecomplexityof.Ifthedistancesfromeachclustertoallotherclustersarestoredasasorted

list(orheap),itispossibletoreducethecostoffindingthetwoclosestclustersto .However,becauseoftheadditionalcomplexityofkeepingdatainasortedlistorheap,theoveralltimerequiredforahierarchicalclusteringbasedonAlgorithm7.4 is .

Thespaceandtimecomplexityofhierarchicalclusteringseverelylimitsthesizeofdatasetsthatcanbeprocessed.Wediscussscalabilityapproachesforclusteringalgorithms,includinghierarchicalclusteringtechniques,inSection8.5 .Note,however,thatthebisectingK-meansalgorithmpresentedinSection7.2.3 isascalablealgorithmthatproducesahierarchicalclustering.

7.3.2SpecificTechniques

SampleDataToillustratethebehaviorofthevarioushierarchicalclusteringalgorithms,wewillusesampledatathatconsistsofsixtwo-dimensionalpoints,whichareshowninFigure7.15 .ThexandycoordinatesofthepointsandtheEuclideandistancesbetweenthemareshowninTables7.3 and7.4 ,respectively.

ith O((m−i+1)2)

O(m−i+1)O(m−i+1)

O(m3)

O(m−i+1)

O(m2logm)

Figure7.15.Setofsixtwo-dimensionalpoints.

Table7.3.xy-coordinatesofsixpoints.

Point xCoordinate yCoordinate

p1 0.4005 0.5306

p2 0.2148 0.3854

p3 0.3457 0.3156

p4 0.2652 0.1875

p5 0.0789 0.4139

p6 0.4548 0.3022

Table7.4.Euclideandistancematrixforsixpoints.

p1 p2 p3 p4 p5 p6

p1 0.00 0.24 0.22 0.37 0.34 0.23

p2 0.24 0.00 0.15 0.20 0.14 0.25

p3 0.22 0.15 0.00 0.15 0.28 0.11

p4 0.37 0.20 0.15 0.00 0.29 0.22

p5 0.34 0.14 0.28 0.29 0.00 0.39

p6 0.23 0.25 0.11 0.22 0.39 0.00

SingleLinkorMINForthesinglelinkorMINversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheminimumofthedistance(maximumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenthesesinglelinkscombinethepointsintoclusters.Thesinglelinktechniqueisgoodathandlingnon-ellipticalshapes,butissensitivetonoiseandoutliers.

Example7.4(SingleLink).Figure7.16 showstheresultofapplyingthesinglelinktechniquetoourexampledatasetofsixpoints.Figure7.16(a) showsthenestedclustersasasequenceofnestedellipses,wherethenumbersassociatedwiththeellipsesindicatetheorderoftheclustering.Figure7.16(b) showsthesameinformation,butasadendrogram.Theheightatwhichtwoclustersaremergedinthedendrogramreflectsthedistanceofthetwoclusters.Forinstance,fromTable7.4 ,weseethatthedistancebetweenpoints3and6is0.11,andthatistheheightatwhichtheyarejoinedintooneclusterinthedendrogram.Asanotherexample,thedistancebetweenclustersand isgivenby

{3,6}{2,5}

dist({3,6},{2,5})=min(dist(3,2),  dist(6,2), dist(3,5), dist(6,5))=

Figure7.16.SinglelinkclusteringofthesixpointsshowninFigure7.15 .

CompleteLinkorMAXorCLIQUEForthecompletelinkorMAXversionofhierarchicalclustering,theproximityoftwoclustersisdefinedasthemaximumofthedistance(minimumofthesimilarity)betweenanytwopointsinthetwodifferentclusters.Usinggraphterminology,ifyoustartwithallpointsassingletonclustersandaddlinksbetweenpointsoneatatime,shortestlinksfirst,thenagroupofpointsisnotaclusteruntilallthepointsinitarecompletelylinked,i.e.,formaclique.Completelinkislesssusceptibletonoiseandoutliers,butitcanbreaklargeclustersanditfavorsglobularshapes.

Example7.5(CompleteLink).

min(0.15, 0.25, 0.28, 0.39)=0.15.

Figure7.17 showstheresultsofapplyingMAXtothesampledatasetofsixpoints.Aswithsinglelink,points3and6aremergedfirst.However,

ismergedwith ,insteadof or because

Figure7.17.CompletelinkclusteringofthesixpointsshowninFigure7.15 .

GroupAverageForthegroupaverageversionofhierarchicalclustering,theproximityoftwoclustersisdefinedastheaveragepairwiseproximityamongallpairsofpointsinthedifferentclusters.Thisisanintermediateapproachbetweenthesingleandcompletelinkapproaches.Thus,forgroupaverage,theclusterproximity

{3,6} {4} {2,5} {1}

dist({3,6},{4})=max(dist(3,4),dist(6,4))=max(0.15,0.22)=0.22.dist({3,6},{2,5})=max(dist(3,2),dist(6,2),dist(3,5),dist(6,5))=max(0.15,0.25,0.28,0.39)=0.39.dist({3,6},{1})=max(dist(3,1),dist(6,1))=max(0.22,0.23)=0.23.

proximity ofclusters and ,whichareofsize and ,respectively,isexpressedbythefollowingequation:

Example7.6(GroupAverage).Figure7.18 showstheresultsofapplyingthegroupaverageapproachtothesampledatasetofsixpoints.Toillustratehowgroupaverageworks,wecalculatethedistancebetweensomeclusters.

Figure7.18.GroupaverageclusteringofthesixpointsshowninFigure7.15 .

(Ci,Cj) Ci Cj mi mj

proximity(Ci, Cj)=∑y∈Cjx∈Ciproximity(x, y)mi×mj. (7.6)

dist({3,6,4},{1})=(0.22+0.37+0.23)/(3×1)=0.28dist({2,5},{1})=(0.24+0.34)/(2×1)=0.29dist({3,6,4},{2,5})=(0.15+0.28+0.25+0.39+0.20+0.29)/(3×2)=0.26

Because issmallerthan and,clusters and aremergedatthefourthstage.

Ward’sMethodandCentroidMethodsForWard’smethod,theproximitybetweentwoclustersisdefinedastheincreaseinthesquarederrorthatresultswhentwoclustersaremerged.Thus,thismethodusesthesameobjectivefunctionasK-meansclustering.WhileitmightseemthatthisfeaturemakesWard’smethodsomewhatdistinctfromotherhierarchicaltechniques,itcanbeshownmathematicallythatWard’smethodisverysimilartothegroupaveragemethodwhentheproximitybetweentwopointsistakentobethesquareofthedistancebetweenthem.

Example7.7(Ward’sMethod).Figure7.19 showstheresultsofapplyingWard’smethodtothesampledatasetofsixpoints.Theclusteringthatisproducedisdifferentfromthoseproducedbysinglelink,completelink,andgroupaverage.

dist({3,6,4},{2,5}) dist({3,6,4},{1}) dist({2,5},{1}) {3,6,4} {2,5}

Figure7.19.Ward’sclusteringofthesixpointsshowninFigure7.15 .

Centroidmethodscalculatetheproximitybetweentwoclustersbycalculatingthedistancebetweenthecentroidsofclusters.ThesetechniquesmayseemsimilartoK-means,butaswehaveremarked,Ward’smethodisthecorrecthierarchicalanalog.

Centroidmethodsalsohaveacharacteristic—oftenconsideredbad—thatisnotpossessedbytheotherhierarchicalclusteringtechniquesthatwehavediscussed:thepossibilityofinversions.Specifically,twoclustersthataremergedcanbemoresimilar(lessdistant)thanthepairofclustersthatweremergedinapreviousstep.Fortheothermethods,thedistancebetweenmergedclustersmonotonicallyincreases(oris,atworst,non-increasing)asweproceedfromsingletonclusterstooneall-inclusivecluster.

7.3.3TheLance-WilliamsFormulaforClusterProximity

Anyoftheclusterproximitiesthatwehavediscussedinthissectioncanbeviewedasachoiceofdifferentparameters(intheLance-WilliamsformulashownbelowinEquation7.7 )fortheproximitybetweenclustersQandR,whereRisformedbymergingclustersAandB.Inthisequation,p(.,.)isaproximityfunction,while , ,and arethenumberofpointsinclustersA,B,andQ,respectively.Inotherwords,afterwemergeclustersAandBtoformclusterR,theproximityofthenewcluster,R,toanexistingcluster,Q,isalinearfunctionoftheproximitiesofQwithrespecttothe

mA mB mQ

originalclustersAandB.Table7.5 showsthevaluesofthesecoefficientsforthetechniquesthatwehavediscussed.

Table7.5.TableofLance-Williamscoefficientsforcommonhierarchicalclusteringapproaches.

ClusteringMethod

SingleLink 1/2 1/2 0 −1/2

CompleteLink 1/2 1/2 0 1/2

GroupAverage 0 0

Centroid 0

Ward’s 0

AnyhierarchicalclusteringtechniquethatcanbeexpressedusingtheLance-Williamsformuladoesnotneedtokeeptheoriginaldatapoints.Instead,theproximitymatrixisupdatedasclusteringoccurs.Whileageneralformulaisappealing,especiallyforimplementation,itiseasiertounderstandthedifferenthierarchicalmethodsbylookingdirectlyatthedefinitionofclusterproximitythateachmethoduses.

7.3.4KeyIssuesinHierarchical

p(R,Q)= αA p(A,Q)+αB p(B,Q)+β p(A,B)+ γ|p(A,Q)−p(B,Q)| (7.7)

αA αB β γ

mAmA+mB mBmA+mB

mAmA+mB mBmA+mB −mAmB(mA+mB)2

mA+mQmA+mB+mQ

mB+mQmA+mB+mQ

−mQmA+mB+mQ

Clustering

LackofaGlobalObjectiveFunctionWepreviouslymentionedthatagglomerativehierarchicalclusteringcannotbeviewedasgloballyoptimizinganobjectivefunction.Instead,agglomerativehierarchicalclusteringtechniquesusevariouscriteriatodecidelocally,ateachstep,whichclustersshouldbemerged(orsplitfordivisiveapproaches).Thisapproachyieldsclusteringalgorithmsthatavoidthedifficultyofattemptingtosolveahardcombinatorialoptimizationproblem.(Itcanbeshownthatthegeneralclusteringproblemforanobjectivefunctionsuchas“minimizeSSE”iscomputationallyinfeasible.)Furthermore,suchapproachesdonothavedifficultiesinchoosinginitialpoints.Nonetheless,thetimecomplexityof

andthespacecomplexityof areprohibitiveinmanycases.

AbilitytoHandleDifferentClusterSizesOneaspectofagglomerativehierarchicalclusteringthatwehavenotyetdiscussedishowtotreattherelativesizesofthepairsofclustersthataremerged.(Thisdiscussionappliesonlytoclusterproximityschemesthatinvolvesums,suchascentroid,Ward’s,andgroupaverage.)Therearetwoapproaches:weighted,whichtreatsallclustersequally,andunweighted,whichtakesthenumberofpointsineachclusterintoaccount.Notethattheterminologyofweightedorunweightedreferstothedatapoints,nottheclusters.Inotherwords,treatingclustersofunequalsizeequally—theweightedapproach—givesdifferentweightstothepointsindifferentclusters,whiletakingtheclustersizeintoaccount—theunweightedapproach—givespointsindifferentclustersthesameweight.

O(m2log m) O(m2)

WewillillustratethisusingthegroupaveragetechniquediscussedinSection7.3.2 ,whichistheunweightedversionofthegroupaveragetechnique.Intheclusteringliterature,thefullnameofthisapproachistheUnweightedPairGroupMethodusingArithmeticaverages(UPGMA).InTable7.5 ,whichgivestheformulaforupdatingclustersimilarity,thecoefficientsforUPGMAinvolvethesize, and ofeachoftheclusters,AandBthatweremerged: .Fortheweightedversionofgroupaverage—knownasWPGMA—thecoefficientsareconstantsthatareindependentoftheclustersizes: .Ingeneral,unweightedapproachesarepreferredunlessthereisreasontobelievethatindividualpointsshouldhavedifferentweights;e.g.,perhapsclassesofobjectshavebeenunevenlysampled.

MergingDecisionsAreFinalAgglomerativehierarchicalclusteringalgorithmstendtomakegoodlocaldecisionsaboutcombiningtwoclustersbecausetheycanuseinformationaboutthepairwisesimilarityofallpoints.However,onceadecisionismadetomergetwoclusters,itcannotbeundoneatalatertime.Thisapproachpreventsalocaloptimizationcriterionfrombecomingaglobaloptimizationcriterion.Forexample,althoughthe“minimizesquarederror”criterionfromK-meansisusedindecidingwhichclusterstomergeinWard’smethod,theclustersateachleveldonotrepresentlocalminimawithrespecttothetotalSSE.Indeed,theclustersarenotevenstable,inthesensethatapointinoneclustercanbeclosertothecentroidofsomeotherclusterthanitistothecentroidofitscurrentcluster.Nonetheless,Ward’smethodisoftenusedasarobustmethodofinitializingaK-meansclustering,indicatingthatalocal“minimizesquarederror”objectivefunctiondoeshaveaconnectiontoaglobal“minimizesquarederror”objectivefunction.

mA mBαA=mA/(mA+mB), αB=mB/(mA+mB), β=0, γ=0

αA=1/2, αB=1/2, β=0, γ=0

Therearesometechniquesthatattempttoovercomethelimitationthatmergesarefinal.Oneapproachattemptstofixupthehierarchicalclusteringbymovingbranchesofthetreearoundsoastoimproveaglobalobjectivefunction.AnotherapproachusesapartitionalclusteringtechniquesuchasK-meanstocreatemanysmallclusters,andthenperformshierarchicalclusteringusingthesesmallclustersasthestartingpoint.

7.3.5Outliers

OutliersposethemostseriousproblemsforWard’smethodandcentroid-basedhierarchicalclusteringapproachesbecausetheyincreaseSSEanddistortcentroids.Forclusteringapproaches,suchassinglelink,completelink,andgroupaverage,outliersarepotentiallylessproblematic.Ashierarchicalclusteringproceedsforthesealgorithms,outliersorsmallgroupsofoutlierstendtoformsingletonorsmallclustersthatdonotmergewithanyotherclustersuntilmuchlaterinthemergingprocess.Bydiscardingsingletonorsmallclustersthatarenotmergingwithotherclusters,outlierscanberemoved.

7.3.6StrengthsandWeaknesses

Thestrengthsandweaknessesofspecificagglomerativehierarchicalclusteringalgorithmswerediscussedabove.Moregenerally,suchalgorithmsaretypicallyusedbecausetheunderlyingapplication,e.g.,creationofataxonomy,requiresahierarchy.Also,somestudieshavesuggestedthatthesealgorithmscanproducebetter-qualityclusters.However,agglomerativehierarchicalclusteringalgorithmsareexpensiveintermsoftheircomputationalandstoragerequirements.Thefactthatallmergesarefinalcan

alsocausetroublefornoisy,high-dimensionaldata,suchasdocumentdata.Inturn,thesetwoproblemscanbeaddressedtosomedegreebyfirstpartiallyclusteringthedatausinganothertechnique,suchasK-means.

7.4DBSCANDensity-basedclusteringlocatesregionsofhighdensitythatareseparatedfromoneanotherbyregionsoflowdensity.DBSCANisasimpleandeffectivedensity-basedclusteringalgorithmthatillustratesanumberofimportantconceptsthatareimportantforanydensity-basedclusteringapproach.Inthissection,wefocussolelyonDBSCANafterfirstconsideringthekeynotionofdensity.Otheralgorithmsforfindingdensity-basedclustersaredescribedinthenextchapter.

7.4.1TraditionalDensity:Center-BasedApproach

Althoughtherearenotasmanyapproachesfordefiningdensityastherearefordefiningsimilarity,thereareseveraldistinctmethods.Inthissectionwediscussthecenter-basedapproachonwhichDBSCANisbased.OtherdefinitionsofdensitywillbepresentedinChapter8 .

Inthecenter-basedapproach,densityisestimatedforaparticularpointinthedatasetbycountingthenumberofpointswithinaspecifiedradius,Eps,ofthatpoint.Thisincludesthepointitself.ThistechniqueisgraphicallyillustratedbyFigure7.20 .ThenumberofpointswithinaradiusofEpsofpointAis7,includingAitself.

Figure7.20.Center-baseddensity.

Thismethodissimpletoimplement,butthedensityofanypointwilldependonthespecifiedradius.Forinstance,iftheradiusislargeenough,thenallpointswillhaveadensityofm,thenumberofpointsinthedataset.Likewise,iftheradiusistoosmall,thenallpointswillhaveadensityof1.Anapproachfordecidingontheappropriateradiusforlow-dimensionaldataisgiveninthenextsectioninthecontextofourdiscussionofDBSCAN.

ClassificationofPointsAccordingtoCenter-BasedDensityThecenter-basedapproachtodensityallowsustoclassifyapointasbeing(1)intheinteriorofadenseregion(acorepoint),(2)ontheedgeofadenseregion(aborderpoint),or(3)inasparselyoccupiedregion(anoiseorbackgroundpoint).Figure7.21 graphicallyillustratestheconceptsofcore,border,andnoisepointsusingacollectionoftwo-dimensionalpoints.Thefollowingtextprovidesamoreprecisedescription.

Figure7.21.Core,border,andnoisepoints.

Corepoints:Thesepointsareintheinteriorofadensity-basedcluster.ApointisacorepointifthereareatleastMinPtswithinadistanceofEps,whereMinPtsandEpsareuser-specifiedparameters.InFigure7.21 ,pointAisacorepointfortheradius(Eps)if .

Borderpoints:Aborderpointisnotacorepoint,butfallswithintheneighborhoodofacorepoint.InFigure7.21 ,pointBisaborderpoint.Aborderpointcanfallwithintheneighborhoodsofseveralcorepoints.

Noisepoints:Anoisepointisanypointthatisneitheracorepointnoraborderpoint.InFigure7.21 ,pointCisanoisepoint.

7.4.2TheDBSCANAlgorithm

Giventhepreviousdefinitionsofcorepoints,borderpoints,andnoisepoints,theDBSCANalgorithmcanbeinformallydescribedasfollows.Anytwocorepointsthatarecloseenough—withinadistanceEpsofoneanother—areput

MinPts≥7

inthesamecluster.Likewise,anyborderpointthatiscloseenoughtoacorepointisputinthesameclusterasthecorepoint.(Tiesneedtoberesolvedifaborderpointisclosetocorepointsfromdifferentclusters.)Noisepointsarediscarded.TheformaldetailsaregiveninAlgorithm7.5 .ThisalgorithmusesthesameconceptsandfindsthesameclustersastheoriginalDBSCAN,butisoptimizedforsimplicity,notefficiency.

Algorithm7.5DBSCANalgorithm.

TimeandSpaceComplexityThebasictimecomplexityoftheDBSCANalgorithmisO( tofindpointsintheEps-neighborhood),wheremisthenumberofpoints.Intheworstcase,thiscomplexityis .However,inlow-dimensionalspaces(especially2Dspace),datastructuressuchaskd-treesallowefficientretrievalofallpointswithinagivendistanceofaspecifiedpoint,andthetimecomplexitycanbeaslowasO(mlogm)intheaveragecase.ThespacerequirementofDBSCAN,evenforhigh-dimensionaldata,isO(m)becauseitisnecessarytokeeponlyasmallamountofdataforeachpoint,i.e.,theclusterlabelandtheidentificationofeachpointasacore,border,ornoisepoint.

Labelallpointsascore,border,ornoisepoints.Eliminatenoisepoints.PutanedgebetweenallcorepointswithinadistanceEpsofeachother.Makeeachgroupofconnectedcorepointsintoaseparatecluster.Assigneachborderpointtooneoftheclustersofitsassociatedcorepoints.

1:2:3:

4:

5:

m×time

O(m2)

SelectionofDBSCANParametersThereis,ofcourse,theissueofhowtodeterminetheparametersEpsandMinPts.Thebasicapproachistolookatthebehaviorofthedistancefromapointtoits nearestneighbor,whichwewillcallthek-dist.Forpointsthatbelongtosomecluster,thevalueofk-distwillbesmallifkisnotlargerthantheclustersize.Notethattherewillbesomevariation,dependingonthedensityoftheclusterandtherandomdistributionofpoints,butonaverage,therangeofvariationwillnotbehugeiftheclusterdensitiesarenotradicallydifferent.However,forpointsthatarenotinacluster,suchasnoisepoints,thek-distwillberelativelylarge.Therefore,ifwecomputethek-distforallthedatapointsforsomek,sorttheminincreasingorder,andthenplotthesortedvalues,weexpecttoseeasharpchangeatthevalueofk-distthatcorrespondstoasuitablevalueofEps.IfweselectthisdistanceastheEpsparameterandtakethevalueofkastheMinPtsparameter,thenpointsforwhichk-distislessthanEpswillbelabeledascorepoints,whileotherpointswillbelabeledasnoiseorborderpoints.

Figure7.22 showsasampledataset,whilethek-distgraphforthedataisgiveninFigure7.23 .ThevalueofEpsthatisdeterminedinthiswaydependsonk,butdoesnotchangedramaticallyaskchanges.Ifthevalueofkistoosmall,thenevenasmallnumberofcloselyspacedpointsthatarenoiseoroutlierswillbeincorrectlylabeledasclusters.Ifthevalueofkistoolarge,thensmallclusters(ofsizelessthank)arelikelytobelabeledasnoise.TheoriginalDBSCANalgorithmusedavalueof ,whichappearstobeareasonablevalueformosttwo-dimensionaldatasets.

kth

k=4

Figure7.22.Sampledata.

Figure7.23.K-distplotforsampledata.

ClustersofVaryingDensityDBSCANcanhavetroublewithdensityifthedensityofclustersvarieswidely.ConsiderFigure7.24 ,whichshowsfourclustersembeddedinnoise.Thedensityoftheclustersandnoiseregionsisindicatedbytheirdarkness.Thenoisearoundthepairofdenserclusters,AandB,hasthesamedensityasclustersCandD.ForafixedMinPts,iftheEpsthresholdischosensothatDBSCANfindsCandDasdistinctclusters,withthepointssurroundingthem

asnoise,thenAandBandthepointssurroundingthemwillbecomeasinglecluster.IftheEpsthresholdissetsothatDBSCANfindsAandBasseparateclusters,andthepointssurroundingthemaremarkedasnoise,thenC,D,andthepointssurroundingthemwillalsobemarkedasnoise.

Figure7.24.Fourclustersembeddedinnoise.

AnExampleToillustratetheuseofDBSCAN,weshowtheclustersthatitfindsintherelativelycomplicatedtwo-dimensionaldatasetshowninFigure7.22 .Thisdatasetconsistsof3000two-dimensionalpoints.TheEpsthresholdforthisdatawasfoundbyplottingthesorteddistancesofthefourthnearestneighborofeachpoint(Figure7.23 )andidentifyingthevalueatwhichthereisasharpincrease.Weselected ,whichcorrespondstothekneeofthecurve.TheclustersfoundbyDBSCANusingtheseparameters,i.e.,and ,areshowninFigure7.25(a) .Thecorepoints,borderpoints,andnoisepointsaredisplayedinFigure7.25(b) .

Eps=10MinPts=4

Eps=10

Figure7.25.DBSCANclusteringof3000two-dimensionalpoints.

7.4.3StrengthsandWeaknesses

BecauseDBSCANusesadensity-baseddefinitionofacluster,itisrelativelyresistanttonoiseandcanhandleclustersofarbitraryshapesandsizes.Thus,DBSCANcanfindmanyclustersthatcouldnotbefoundusingK-means,suchasthoseinFigure7.22 .Asindicatedpreviously,however,DBSCANhastroublewhentheclustershavewidelyvaryingdensities.Italsohastroublewithhigh-dimensionaldatabecausedensityismoredifficulttodefineforsuchdata.OnepossibleapproachtodealingwithsuchissuesisgiveninSection8.4.9 .Finally,DBSCANcanbeexpensivewhenthecomputationofnearestneighborsrequirescomputingallpairwiseproximities,asisusuallythecaseforhigh-dimensionaldata.

7.5ClusterEvaluationInsupervisedclassification,theevaluationoftheresultingclassificationmodelisanintegralpartoftheprocessofdevelopingaclassificationmodel,andtherearewell-acceptedevaluationmeasuresandprocedures,e.g.,accuracyandcross-validation,respectively.However,becauseofitsverynature,clusterevaluationisnotawell-developedorcommonlyusedpartofclusteranalysis.Nonetheless,clusterevaluation,orclustervalidationasitismoretraditionallycalled,isimportant,andthissectionwillreviewsomeofthemostcommonandeasilyappliedapproaches.

Theremightbesomeconfusionastowhyclusterevaluationisnecessary.Manytimes,clusteranalysisisconductedasapartofanexploratorydataanalysis.Hence,evaluationseemstobeanunnecessarilycomplicatedadditiontowhatissupposedtobeaninformalprocess.Furthermore,becausethereareanumberofdifferenttypesofclusters—insomesense,eachclusteringalgorithmdefinesitsowntypeofcluster—itcanseemthateachsituationmightrequireadifferentevaluationmeasure.Forinstance,K-meansclustersmightbeevaluatedintermsoftheSSE,butfordensity-basedclusters,whichneednotbeglobular,SSEwouldnotworkwellatall.

Nonetheless,clusterevaluationshouldbeapartofanyclusteranalysis.Akeymotivationisthatalmosteveryclusteringalgorithmwillfindclustersinadataset,evenifthatdatasethasnonaturalclusterstructure.Forinstance,considerFigure7.26 ,whichshowstheresultofclustering100pointsthatarerandomly(uniformly)distributedontheunitsquare.TheoriginalpointsareshowninFigure7.26(a) ,whiletheclustersfoundbyDBSCAN,K-means,andcompletelinkareshowninFigures7.26(b) ,7.26(c) ,and7.26(d) ,respectively.SinceDBSCANfoundthreeclusters(afterwesetEpsbylooking

atthedistancesofthefourthnearestneighbors),wesetK-meansandcompletelinktofindthreeclustersaswell.(InFigure7.26(b) thenoiseisshownbythesmallmarkers.)However,theclustersdonotlookcompellingforanyofthethreemethods.Inhigherdimensions,suchproblemscannotbesoeasilydetected.

Figure7.26.

Clusteringof100uniformlydistributedpoints.

7.5.1Overview

Beingabletodistinguishwhetherthereisnon-randomstructureinthedataisjustoneimportantaspectofclustervalidation.Thefollowingisalistofseveralimportantissuesforclustervalidation.

1. Determiningtheclusteringtendencyofasetofdata,i.e.,distinguishingwhethernon-randomstructureactuallyexistsinthedata.

2. Determiningthecorrectnumberofclusters.3. Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout

referencetoexternalinformation.4. Comparingtheresultsofaclusteranalysistoexternallyknownresults,

suchasexternallyprovidedclasslabels.5. Comparingtwosetsofclusterstodeterminewhichisbetter.

Noticethatitems1,2,and3donotmakeuseofanyexternalinformation—theyareunsupervisedtechniques—whileitem4requiresexternalinformation.Item5canbeperformedineitherasupervisedoranunsupervisedmanner.Afurtherdistinctioncanbemadewithrespecttoitems3,4,and5:Dowewanttoevaluatetheentireclusteringorjustindividualclusters?

Whileitispossibletodevelopvariousnumericalmeasurestoassessthedifferentaspectsofclustervaliditymentionedabove,thereareanumberofchallenges.First,ameasureofclustervalidityissometimesquitelimitedinthescopeofitsapplicability.Forexample,mostworkonmeasuresofclusteringtendencyhasbeendonefortwo-orthree-dimensionalspatialdata.Second,weneedaframeworktointerpretanymeasure.Ifweobtainavalue

of10forameasurethatevaluateshowwellclusterlabelsmatchexternallyprovidedclasslabels,doesthisvaluerepresentagood,fair,orpoormatch?Thegoodnessofamatchoftencanbemeasuredbylookingatthestatisticaldistributionofthisvalue,i.e.,howlikelyitisthatsuchavalueoccursbychance.Finally,ifameasureistoocomplicatedtoapplyortounderstand,thenfewwilluseit.

Theevaluationmeasures,orindices,thatareappliedtojudgevariousaspectsofclustervalidityaretraditionallyclassifiedintothefollowingthreetypes.

Unsupervised.Measuresthegoodnessofaclusteringstructurewithoutrespecttoexternalinformation.AnexampleofthisistheSSE.Unsupervisedmeasuresofclustervalidityareoftenfurtherdividedintotwoclasses:measuresofclustercohesion(compactness,tightness),whichdeterminehowcloselyrelatedtheobjectsinaclusterare,andmeasuresofclusterseparation(isolation),whichdeterminehowdistinctorwell-separatedaclusterisfromotherclusters.Unsupervisedmeasuresareoftencalledinternalindicesbecausetheyuseonlyinformationpresentinthedataset.

Supervised.Measurestheextenttowhichtheclusteringstructurediscoveredbyaclusteringalgorithmmatchessomeexternalstructure.Anexampleofasupervisedindexisentropy,whichmeasureshowwellclusterlabelsmatchexternallysuppliedclasslabels.Supervisedmeasuresareoftencalledexternalindicesbecausetheyuseinformationnotpresentinthedataset.

Relative.Comparesdifferentclusteringsorclusters.Arelativeclusterevaluationmeasureisasupervisedorunsupervisedevaluationmeasurethatisusedforthepurposeofcomparison.Thus,relativemeasuresarenotactuallyaseparatetypeofclusterevaluationmeasure,butareinsteadaspecificuseofsuchmeasures.Asanexample,twoK-meansclusteringscanbecomparedusingeithertheSSEorentropy.

Intheremainderofthissection,weprovidespecificdetailsconcerningclustervalidity.Wefirstdescribetopicsrelatedtounsupervisedclusterevaluation,beginningwith(1)measuresbasedoncohesionandseparation,and(2)twotechniquesbasedontheproximitymatrix.Sincetheseapproachesareusefulonlyforpartitionalsetsofclusters,wealsodescribethepopularcopheneticcorrelationcoefficient,whichcanbeusedfortheunsupervisedevaluationofahierarchicalclustering.Weendourdiscussionofunsupervisedevaluationwithbriefdiscussionsaboutfindingthecorrectnumberofclustersandevaluatingclusteringtendency.Wethenconsidersupervisedapproachestoclustervalidity,suchasentropy,purity,andtheJaccardmeasure.Weconcludethissectionwithashortdiscussionofhowtointerpretthevaluesof(unsupervisedorsupervised)validitymeasures.

7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparation

Manyinternalmeasuresofclustervalidityforpartitionalclusteringschemesarebasedonthenotionsofcohesionorseparation.Inthissection,weuseclustervaliditymeasuresforprototype-andgraph-basedclusteringtechniquestoexplorethesenotionsinsomedetail.Intheprocess,wewillalsoseesomeinterestingrelationshipsbetweenprototype-andgraph-basedmeasures.

Ingeneral,wecanconsiderexpressingoverallclustervalidityforasetofKclustersasaweightedsumofthevalidityofindividualclusters,

overallvalidity=∑i=1Kwi validity(Ci). (7.8)

Thevalidityfunctioncanbecohesion,separation,orsomecombinationofthesequantities.Theweightswillvarydependingontheclustervaliditymeasure.Insomecases,theweightsaresimply1orthesizeofthecluster,whileinothercasestobediscussedabitlater,theyreflectamorecomplicatedpropertyofthecluster.

Graph-BasedViewofCohesionandSeparationFromagraph-basedview,thecohesionofaclustercanbedefinedasthesumoftheweightsofthelinksintheproximitygraphthatconnectpointswithinthecluster.SeeFigure7.27(a) .(Recallthattheproximitygraphhasdataobjectsasnodes,alinkbetweeneachpairofdataobjects,andaweightassignedtoeachlinkthatistheproximitybetweenthetwodataobjectsconnectedbythelink.)Likewise,theseparationbetweentwoclusterscanbemeasuredbythesumoftheweightsofthelinksfrompointsinoneclustertopointsintheothercluster.ThisisillustratedinFigure7.27(b) .

Figure7.27.Graph-basedviewofclustercohesionandseparation.

Mostsimply,thecohesionandseparationforagraph-basedclustercanbeexpressedusingEquations7.9 and7.10 ,respectively.Theproximityfunctioncanbeasimilarityoradissimilarity.Forsimilarity,asinTable7.6 ,highervaluesarebetterforcohesionwhilelowervaluesarebetterforseparation.Fordissimilarity,theoppositeistrue,i.e.,lowervaluesarebetterforcohesionwhilehighervaluesarebetterforseparation.Morecomplicatedapproachesarepossiblebuttypicallyembodythebasicideasoffigure7.27a and7.27b .

Prototype-BasedViewofCohesionandSeparationForprototype-basedclusters,thecohesionofaclustercanbedefinedasthesumoftheproximitieswithrespecttotheprototype(centroidormedoid)ofthecluster.Similarly,theseparationbetweentwoclusterscanbemeasuredbytheproximityofthetwoclusterprototypes.ThisisillustratedinFigure7.28 ,wherethecentroidofaclusterisindicatedbya“+”.

cohesion(Ci)=∑x∈Ciy∈Ciproximity(x,y) (7.9)

separation(Ci,Cj)=∑x∈Ciy∈Cjproximity(x,y) (7.10)

Figure7.28.Prototype-basedviewofclustercohesionandseparation.

Cohesionforaprototype-basedclusterisgiveninEquation7.11 ,whiletwomeasuresforseparationaregiveninEquations7.12 and7.13 ,respectively,where istheprototype(centroid)ofcluster andcistheoverallprototype(centroid).Therearetwomeasuresforseparationbecause,aswewillseeinthenextsection,theseparationofclusterprototypesfromanoverallprototypeissometimesdirectlyrelatedtotheseparationofclusterprototypesfromoneanother.(Thisistrue,forexample,forEuclideandistance.)NotethatEquation7.11 istheclusterSSEifweletproximitybethesquaredEuclideandistance.

RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesionWhilethegraph-basedandprototype-basedapproachestomeasuringthecohesionandseparationofaclusterseemdistinct,forsomeproximitymeasurestheyareequivalent.Forinstance,fortheSSEandpointsinEuclideanspace,itcanbeshown(Equation7.14 )thattheaveragepairwisedistancebetweenthepointsinaclusterisequivalenttotheSSEofthecluster.SeeExercise27 onpage610.

ci Ci

cohesion(Ci)=∑x∈Ciproximity(x,ci) (7.11)

separation(Ci,Cj)=proximity(ci,cj) (7.12)

separation(Ci)=proximity(ci,c) (7.13)

ClusterSSE=∑x∈Cidist(ci,x)2=12mi∑x∈Ci∑y∈Cidist(x,y)2(7.14)

RelationshipoftheTwoApproachestoPrototype-BasedSeparationWhenproximityismeasuredbyEuclideandistance,thetraditionalmeasureofseparationbetweenclustersisthebetweengroupsumofsquares(SSB),whichisthesumofthesquareddistanceofaclustercentroid, ,totheoverallmean,c,ofallthedatapoints.TheSSBisgivenbyEquation7.15 ,where isthemeanofthe clusterandcistheoverallmean.ThehigherthetotalSSBofaclustering,themoreseparatedtheclustersarefromoneanother.

ItisstraightforwardtoshowthatthetotalSSBisdirectlyrelatedtothepairwisedistancesbetweenthecentroids.Inparticular,iftheclustersizesareequal,i.e., ,thenthisrelationshiptakesthesimpleformgivenbyEquation7.16 .(SeeExercise28 onpage610.)ItisthistypeofequivalencethatmotivatesthedefinitionofprototypeseparationintermsofbothEquations7.12 and7.13 .

RelationshipbetweenCohesionandSeparationForsomevaliditymeasures,thereisalsoastrongrelationshipbetweencohesionandseparation.Specifically,itispossibletoshowthatthesumofthetotalSSEandthetotalSSBisaconstant;i.e.,thatitisequaltothetotalsumofsquares(TSS),whichisthesumofsquaresofthedistanceofeach

ci

ci ith

TotalSSB=∑i=1Kmi dist(ci,c)2 (7.15)

mi=m/K

TotalSSB=12K∑i=1K∑j=1KmK dist(ci, cj)2 (7.16)

pointtotheoverallmeanofthedata.TheimportanceofthisresultisthatminimizingSSE(cohesion)isequivalenttomaximizingSSB(separation).

Weprovidetheproofofthisfactbelow,sincetheapproachillustratestechniquesthatarealsoapplicabletoprovingtherelationshipsstatedinthelasttwosections.Tosimplifythenotation,weassumethatthedataisone-dimensional,i.e., .Also,weusethefactthatthecross-term

is0.(SeeExercise29 onpage610.)

RelationshipbetweenGraph-andCentroid-BasedCohesionItcanalsobeshownthatthereisarelationshipbetweengraph-andcentroid-basedcohesionmeasuresforEuclideandistance.Forsimplicity,weonceagainassumeone-dimensionaldata.Recallthat .

Moregenerally,incaseswhereacentroidmakessenseforthedata,thesimplegraphorcentroid-basedmeasuresofclustervaliditywepresentedareoftenrelated.

dist(x,y)=(x−y)2∑i=1K∑x∈Ci(x−ci)(c−ci)

TSS=∑i=1K∑x∈Ci(x−c)2=∑i=1K∑x∈Ci((x−ci)−(c−ci))2=∑i=1K∑x∈Ci(x−ci)2−2∑i=1K∑x∈Ci(x−ci)(c−ci)+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K∑x∈Ci(c−ci)2=∑i=1K∑x∈Ci(x−ci)2+∑i=1K|Ci|(c−ci)2=SSE+SSB

ci=1/mi∑y∈Ciy

mi2cohesion(Ci)=mi2∑x∈Ciproximity(x,ci)=∑x∈Cimi2(x−ci)2=∑x∈Ci(mix−mici)2=∑x∈Ci(mix−mi(1/mi∑y∈Ciy))2=∑x∈Ci∑y∈Ci(x−y)2=∑x∈Ciy∈Ci(x−y)2=∑x∈Ciy∈Ciproximity(x,y)

OverallMeasuresofCohesionandSeparationThepreviousdefinitionsofclustercohesionandseparationgaveussomesimpleandwell-definedmeasuresofindividualclustervaliditythatcanbecombinedintoanoverallmeasureofclustervaliditybyusingaweightedsum,asindicatedinEquation7.8 .However,weneedtodecidewhatweightstouse.Notsurprisingly,theweightsusedcanvarywidely.Often,butnotalways,theyareeitherafunctionofclustersizeor1,whichtreatsallclustersequallyregardlessofsize.

TheCLUsteringTOolkit(CLUTO)(seetheBibliographicNotes)usestheclusterevaluationmeasuresdescribedinTable7.6 ,aswellassomeotherevaluationmeasuresnotmentionedhere.Onlysimilaritymeasuresareused:cosine,correlation,Jaccard,andtheinverseofEuclideandistance. isameasureofcohesionintermsofthepairwisesimilarityofobjectsinthecluster. isameasureofcohesionthatcanbeexpressedeitherintermsofthesumofthesimilaritiesofobjectsintheclustertotheclustercentroidorintermsofthepairwisesimilaritiesofobjectsinthecluster. isameasureofseparation.Itcanbedefinedintermsofthesimilarityofaclustercentroidtotheoverallcentroidorintermsofthepairwisesimilaritiesofobjectintheclustertoobjectsinotherclusters.(Although isameasureofseparation,theseconddefinitionshowsthatitalsousesclustercohesion,albeitintheclusterweight.) ,whichisaclustervaliditymeasurebasedonbothcohesionandseparation,isthesumofthepairwisesimilaritiesofallobjectsintheclusterwithallobjectsoutsidethecluster—thetotalweightoftheedgesofthesimilaritygraphthatmustbecuttoseparatetheclusterfromallotherclusters—dividedbythesumofthepairwisesimilaritiesofobjectsinthecluster.

Table7.6.Tableofgraph-basedclusterevaluationmeasures.

ℐ1

ℐ2

ε1

ε1

1

Name ClusterMeasure ClusterWeight Type

graph-basedcohesion

prototype-basedcohesion

prototype-basedcohesion

prototype-basedseparation

graph-basedseparation

graph-basedseparationandcohesion

Notethatanyunsupervisedmeasureofclustervaliditypotentiallycanbeusedasanobjectivefunctionforaclusteringalgorithmandviceversa.CLUTOtakesthisapproachbyusinganalgorithmthatissimilartotheincrementalK-meansalgorithmdiscussedinSection7.2.2 .Specifically,eachpointisassignedtotheclusterthatproducesthebestvaluefortheclusterevaluationfunction.Theclusterevaluationmeasure correspondstotraditionalK-meansandproducesclustersthathavegoodSSEvalues.TheothermeasuresproduceclustersthatarenotasgoodwithrespecttoSSE,butthataremoreoptimalwithrespecttothespecifiedclustervaliditymeasure.

EvaluatingIndividualClustersandObjectsSofar,wehavefocusedonusingcohesionandseparationintheoverallevaluationofagroupofclusters.Mostofthesemeasuresofclustervalidityalsocanbeusedtoevaluateindividualclustersandobjects.Forexample,wecanrankindividualclustersaccordingtotheirspecificvalueofclustervalidity,i.e.,clustercohesionorseparation.Aclusterthathasahighvalueofcohesion

ℐ1 ∑x∈Ciy∈Cisim(x,y) 1mi

ℐ2 ∑x∈Cisim(x,ci)

ℐ2 ∑x∈Ciy∈Cisim(x,y)

ε1 sim(ci,c) mi

ε1 ∑j=1k∑x∈Ciy∈Cjsim(x,y)

mi(∑x∈Ciy∈Cisim(x,y))

1 ∑j=1j≠ik∑x∈Ciy∈Cjsim(x,y)

1∑x∈Ciy∈Cisim(x,y)

ℐ2

maybeconsideredbetterthanaclusterthathasalowervalue.Thisinformationoftencanbeusedtoimprovethequalityofaclustering.If,forexample,aclusterisnotverycohesive,thenwemaywanttosplititintoseveralsubclusters.Ontheotherhand,iftwoclustersarerelativelycohesive,butnotwellseparated,wemaywanttomergethemintoasinglecluster.

Wecanalsoevaluatetheobjectswithinaclusterintermsoftheircontributiontotheoverallcohesionorseparationofthecluster.Objectsthatcontributemoretothecohesionandseparationarenearthe“interior”ofthecluster.Thoseobjectsforwhichtheoppositeistrueareprobablynearthe“edge”ofthecluster.Inthefollowingsection,weconsideraclusterevaluationmeasurethatusesanapproachbasedontheseideastoevaluatepoints,clusters,andtheentiresetofclusters.

TheSilhouetteCoefficientThepopularmethodofsilhouettecoefficientscombinesbothcohesionandseparation.Thefollowingstepsexplainhowtocomputethesilhouettecoefficientforanindividualpoint,aprocessthatconsistsofthefollowingthreesteps.Weusedistances,butananalogousapproachcanbeusedforsimilarities.

1. Forthe object,calculateitsaveragedistancetoallotherobjectsinitscluster.Callthisvalue .

2. Forthe objectandanyclusternotcontainingtheobject,calculatetheobject’saveragedistancetoalltheobjectsinthegivencluster.Findtheminimumsuchvaluewithrespecttoallclusters;callthisvalue .

3. Forthe object,thesilhouettecoefficientis.

ithai

ith

biith si=(bi−ai)/max(ai,

bi)

Thevalueofthesilhouettecoefficientcanvarybetween and1.Anegativevalueisundesirablebecausethiscorrespondstoacaseinwhich ,theaveragedistancetopointsinthecluster,isgreaterthan ,theminimumaveragedistancetopointsinanothercluster.Wewantthesilhouettecoefficienttobepositive ,andfor tobeascloseto0aspossible,sincethecoefficientassumesitsmaximumvalueof1when .

Wecancomputetheaveragesilhouettecoefficientofaclusterbysimplytakingtheaverageofthesilhouettecoefficientsofpointsbelongingtothecluster.Anoverallmeasureofthegoodnessofaclusteringcanbeobtainedbycomputingtheaveragesilhouettecoefficientofallpoints.

Example7.8(SilhouetteCoefficient).Figure7.29 showsaplotofthesilhouettecoefficientsforpointsin10clusters.Darkershadesindicatelowersilhouettecoefficients.

Figure7.29.Silhouettecoefficientsforpointsintenclusters.

7.5.3UnsupervisedClusterEvaluation

−1ai

bi

(ai<bi) aiai=0

UsingtheProximityMatrix

Inthissection,weexamineacoupleofunsupervisedapproachesforassessingclustervaliditythatarebasedontheproximitymatrix.Thefirstcomparesanactualandidealizedproximitymatrix,whilethesecondusesvisualization.

GeneralCommentsonUnsupervisedClusterEvaluationMeasuresInadditiontothemeasurespresentedabove,alargenumberofothermeasureshavebeenproposedasunsupervisedclustervaliditymeasures.Almostallthesemeasures,includingthemeasurespresentedaboveareintendedforpartitional,center-basedclusters.Inparticular,noneofthemdoeswellforcontinuity-ordensity-basedclusters.Indeed,arecentevaluation—seeBibliographicNotes—ofadozensuchmeasuresfoundthatalthoughanumberofthemdidwellintermsofhandlingissuessuchasnoiseanddifferingsizesanddensity,noneofthemexceptarelativelyrecentlyproposedmeasure,ClusteringValidationindexbasedonNearestNeighbors(CVNN),didwellonhandlingarbitraryshapes.Thesilhouetteindex,however,didwellonallotherissuesexaminedexceptforthat.

Mostunsupervisedclusterevaluationmeasures,suchasthesilhouettecoefficient,incorporatebothcohesion(compactness)andseparation.WhenusedwithapartitionalclusteringalgorithmsuchasK-means,thesemeasureswilltendtodecreaseuntilthe“natural”setofclustersisfoundandstartincreasingonceclustersarebeingsplit“toofinely”sinceseparationwillsufferandcohesionwillnotimprovemuch.Thus,thesemeasurescanprovideawaytodeterminethenumberofclusters.However,ifthedefinitionofacluster

usedbytheclusteringalgorithm,differsfromthatoftheclusterevaluationmeasure,thenthesetofclustersidentifiedasoptimalbythealgorithmandvalidationmeasurecandiffer.Forinstance,CLUTOusesthemeasuresdescribedinTable7.6 todriveitsclustering,andthus,theclusteringproducedwillnotusuallymatchtheoptimalclustersaccordingtothesilhouettecoefficient.LikewiseforstandardK-meansandSSE.Inaddition,ifthereactuallyaresubclustersthatarenotseparatedverywellfromoneanother,thenmethodsthatincorporatebothmayprovideonlyacoarseviewoftheclusterstructureofthedata.Anotherconsiderationisthatwhenclusteringforsummarization,wearenotinterestedinthe“natural”clusterstructureofthedata,butratherwanttoachieveacertainlevelofapproximation,e.g.,wanttoreduceSSEtoacertainlevel.

Moregenerally,iftherearenottoomanyclusters,thenitcanbebettertoexamineclustercohesionandseparationindependently.Thiscangiveamorecomprehensiveviewofhowcohesiveeachclusterisandhowwelleachpairofclustersisseparatedfromoneanother.Forinstance,givenacentroidbasedclustering,wecouldcomputethepairwisesimilarityordistanceofthecentroids,i.e.,computethedistanceorsimilaritymatrixofthecentroids.Theapproachjustoutlinedissimilartolookingattheconfusionmatrixforaclassificationprobleminsteadofclassificationmeasures,suchasaccuracyortheF-measure.

MeasuringClusterValidityviaCorrelationIfwearegiventhesimilaritymatrixforadatasetandtheclusterlabelsfromaclusteranalysisofthedataset,thenwecanevaluatethe“goodness”oftheclusteringbylookingatthecorrelationbetweenthesimilaritymatrixandanidealversionofthesimilaritymatrixbasedontheclusterlabels.(Withminorchanges,thefollowingappliestoproximitymatrices,butforsimplicity,wediscussonlysimilaritymatrices.)Morespecifically,anidealclusterisone

whosepointshaveasimilarityof1toallpointsinthecluster,andasimilarityof0toallpointsinotherclusters.Thus,ifwesorttherowsandcolumnsofthesimilaritymatrixsothatallobjectsbelongingtothesameclusteraretogether,thenanidealclustersimilaritymatrixhasablockdiagonalstructure.Inotherwords,thesimilarityisnon-zero,i.e.,1,insidetheblocksofthesimilaritymatrixwhoseentriesrepresentintra-clustersimilarity,and0elsewhere.Theidealclustersimilaritymatrixisconstructedbycreatingamatrixthathasonerowandonecolumnforeachdatapoint—justlikeanactualsimilaritymatrix—andassigninga1toanentryiftheassociatedpairofpointsbelongstothesamecluster.Allotherentriesare0.

Highcorrelationbetweentheidealandactualsimilaritymatricesindicatesthatthepointsthatbelongtothesameclusterareclosetoeachother,whilelowcorrelationindicatestheopposite.(Becausetheactualandidealsimilaritymatricesaresymmetric,thecorrelationiscalculatedonlyamongthe

entriesbeloworabovethediagonalofthematrices.)Consequently,thisisnotagoodmeasureformanydensity-orcontiguity-basedclusters,becausetheyarenotglobularandcanbecloselyintertwinedwithotherclusters.

Example7.9(CorrelationofActualandIdealSimilarityMatrices).Toillustratethismeasure,wecalculatedthecorrelationbetweentheidealandactualsimilaritymatricesfortheK-meansclustersshowninFigure7.26(c) (randomdata)andFigure7.30(a) (datawiththreewell-separatedclusters).Thecorrelationswere0.5810and0.9235,respectively,whichreflectstheexpectedresultthattheclustersfoundbyK-meansintherandomdataareworsethantheclustersfoundbyK-meansindatawithwell-separatedclusters.

n(n−1)/2

Figure7.30.Similaritymatrixforwell-separatedclusters.

JudgingaClusteringVisuallybyItsSimilarityMatrixTheprevioustechniquesuggestsamoregeneral,qualitativeapproachtojudgingasetofclusters:Orderthesimilaritymatrixwithrespecttoclusterlabelsandthenplotit.Intheory,ifwehavewell-separatedclusters,thenthesimilaritymatrixshouldberoughlyblock-diagonal.Ifnot,thenthepatternsdisplayedinthesimilaritymatrixcanrevealtherelationshipsbetweenclusters.Again,allofthiscanbeappliedtodissimilaritymatrices,butforsimplicity,wewillonlydiscusssimilaritymatrices.

Example7.10(VisualizingaSimilarityMatrix).ConsiderthepointsinFigure7.30(a) ,whichformthreewell-separatedclusters.IfweuseK-meanstogroupthesepointsintothreeclusters,then

weshouldhavenotroublefindingtheseclustersbecausetheyarewell-separated.TheseparationoftheseclustersisillustratedbythereorderedsimilaritymatrixshowninFigure7.30(b) .(Foruniformity,wehavetransformedthedistancesintosimilaritiesusingtheformula

.)Figure7.31 showsthereorderedsimilaritymatricesforclustersfoundintherandomdatasetofFigure7.26 byDBSCAN,K-means,andcompletelink.

Figure7.31.Similaritymatricesforclustersfromrandomdata.

Thewell-separatedclustersinFigure7.30 showaverystrong,block-diagonalpatterninthereorderedsimilaritymatrix.However,therearealsoweakblockdiagonalpatterns—seeFigure7.31 —inthereorderedsimilaritymatricesoftheclusteringsfoundbyK-means,DBSCAN,andcompletelinkintherandomdata.Justaspeoplecanfindpatternsinclouds,dataminingalgorithmscanfindclustersinrandomdata.Whileitisentertainingtofindpatternsinclouds,itispointlessandperhapsembarrassingtofindclustersinnoise.

Thisapproachmayseemhopelesslyexpensiveforlargedatasets,sincethecomputationoftheproximitymatrixtakes time,wheremisthenumber

s=1−(d−min_d)/(max_d−min_d)

O(m2)

ofobjects,butwithsampling,thismethodcanstillbeused.Wecantakeasampleofdatapointsfromeachcluster,computethesimilaritybetweenthesepoints,andplottheresult.Itissometimesnecessarytooversamplesmallclustersandundersamplelargeonestoobtainanadequaterepresentationofallclusters.

7.5.4UnsupervisedEvaluationofHierarchicalClustering

Thepreviousapproachestoclusterevaluationareintendedforpartitionalclusterings.Herewediscussthecopheneticcorrelation,apopularevaluationmeasureforhierarchicalclusterings.Thecopheneticdistancebetweentwoobjectsistheproximityatwhichanagglomerativehierarchicalclusteringtechniqueputstheobjectsinthesameclusterforthefirsttime.Forexample,ifatsomepointintheagglomerativehierarchicalclusteringprocess,thesmallestdistancebetweenthetwoclustersthataremergedis0.1,thenallpointsinoneclusterhaveacopheneticdistanceof0.1withrespecttothepointsintheothercluster.Inacopheneticdistancematrix,theentriesarethecopheneticdistancesbetweeneachpairofobjects.Thecopheneticdistanceisdifferentforeachhierarchicalclusteringofasetofpoints.

Example7.11(CopheneticDistanceMatrix).Table7.7 showsthecopheneticdistancematrixforthesinglelinkclusteringshowninFigure7.16 .(Thedataforthisfigureconsistsofthesixtwo-dimensionalpointsgiveninTable2.14 .)

Table7.7.CopheneticdistancematrixforsinglelinkanddatainTable

2.14 onpage90.

Point P1 P2 P3 P4 P5 P6

P1 0 0.222 0.222 0.222 0.222 0.222

P2 0.222 0 0.148 0.151 0.139 0.148

P3 0.222 0.148 0 0.151 0.148 0.110

P4 0.222 0.151 0.151 0 0.151 0.151

P5 0.222 0.139 0.148 0.151 0 0.148

P6 0.222 0.148 0.110 0.151 0.148 0

TheCopheneticCorrelationCoefficient(CPCC)isthecorrelationbetweentheentriesofthismatrixandtheoriginaldissimilaritymatrixandisastandardmeasureofhowwellahierarchicalclustering(ofaparticulartype)fitsthedata.Oneofthemostcommonusesofthismeasureistoevaluatewhichtypeofhierarchicalclusteringisbestforaparticulartypeofdata.

Example7.12(CopheneticCorrelationCoefficient).WecalculatedtheCPCCforthehierarchicalclusteringsshowninFigures7.16 –7.19 .ThesevaluesareshowninTable7.8 .Thehierarchicalclusteringproducedbythesinglelinktechniqueseemstofitthedatalesswellthantheclusteringsproducedbycompletelink,groupaverage,andWard’smethod.

Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14andfouragglomerativehierarchicalclusteringtechniques.

Technique CPCC

SingleLink 0.44

CompleteLink 0.63

GroupAverage 0.66

Ward’s 0.64

7.5.5DeterminingtheCorrectNumberofClusters

Variousunsupervisedclusterevaluationmeasurescanbeusedtoapproximatelydeterminethecorrectornaturalnumberofclusters.

Example7.13(NumberofClusters).ThedatasetofFigure7.29 has10naturalclusters.Figure7.32showsaplotoftheSSEversusthenumberofclustersfora(bisecting)K-meansclusteringofthedataset,whileFigure7.33 showstheaveragesilhouettecoefficientversusthenumberofclustersforthesamedata.ThereisadistinctkneeintheSSEandadistinctpeakinthesilhouettecoefficientwhenthenumberofclustersisequalto10.

Figure7.32.SSEversusnumberofclustersforthedataofFigure7.29 onpage582.

Figure7.33.AveragesilhouettecoefficientversusnumberofclustersforthedataofFigure7.29 .

Thus,wecantrytofindthenaturalnumberofclustersinadatasetbylookingforthenumberofclustersatwhichthereisaknee,peak,ordipintheplotoftheevaluationmeasurewhenitisplottedagainstthenumberofclusters.Ofcourse,suchanapproachdoesnotalwaysworkwell.Clusterscanbe

considerablymoreintertwinedoroverlappingthanthoseshowninFigure7.29 .Also,thedatacanconsistofnestedclusters.Actually,theclustersinFigure7.29 aresomewhatnested;i.e.,therearefivepairsofclusterssincetheclustersareclosertoptobottomthantheyarelefttoright.ThereisakneethatindicatesthisintheSSEcurve,butthesilhouettecoefficientcurveisnotasclear.Insummary,whilecautionisneeded,thetechniquewehavejustdescribedcanprovideinsightintothenumberofclustersinthedata.

7.5.6ClusteringTendency

Oneobviouswaytodetermineifadatasethasclustersistotrytoclusterit.However,almostallclusteringalgorithmswilldutifullyfindclusterswhengivendata.Toaddressthisissue,wecouldevaluatetheresultingclustersandonlyclaimthatadatasethasclustersifatleastsomeoftheclustersareofgoodquality.However,thisapproachdoesnotaddressthefacttheclustersinthedatacanbeofadifferenttypethanthosesoughtbyourclusteringalgorithm.Tohandlethisadditionalproblem,wecouldusemultiplealgorithmsandagainevaluatethequalityoftheresultingclusters.Iftheclustersareuniformlypoor,thenthismayindeedindicatethattherearenoclustersinthedata.

Alternatively,andthisisthefocusofmeasuresofclusteringtendency,wecantrytoevaluatewhetheradatasethasclusterswithoutclustering.Themostcommonapproach,especiallyfordatainEuclideanspace,hasbeentousestatisticaltestsforspatialrandomness.Unfortunately,choosingthecorrectmodel,estimatingtheparameters,andevaluatingthestatisticalsignificanceofthehypothesisthatthedataisnon-randomcanbequitechallenging.Nonetheless,manyapproacheshavebeendeveloped,mostofthemforpointsinlow-dimensionalEuclideanspace.

Example7.14(HopkinsStatistic).Forthisapproach,wegenerateppointsthatarerandomlydistributedacrossthedataspaceandalsosamplepactualdatapoints.Forbothsetsofpoints,wefindthedistancetothenearestneighborintheoriginaldataset.Letthe bethenearestneighbordistancesoftheartificiallygeneratedpoints,whilethe arethenearestneighbordistancesofthesampleofpointsfromtheoriginaldataset.TheHopkinsstatisticHisthendefinedbyEquation7.17 .

Iftherandomlygeneratedpointsandthesampleofdatapointshaveroughlythesamenearestneighbordistances,thenHwillbenear0.5.ValuesofHnear0and1indicate,respectively,datathatishighlyclusteredanddatathatisregularlydistributedinthedataspace.Togiveanexample,theHopkinsstatisticforthedataofFigure7.26 wascomputedforand100differenttrials.TheaveragevalueofHwas0.56withastandarddeviationof0.03.Thesameexperimentwasperformedforthewell-separatedpointsofFigure7.30 .TheaveragevalueofHwas0.95withastandarddeviationof0.006.

7.5.7SupervisedMeasuresofClusterValidity

Whenwehaveexternalinformationaboutdata,itistypicallyintheformofexternallyderivedclasslabelsforthedataobjects.Insuchcases,theusualprocedureistomeasurethedegreeofcorrespondencebetweenthecluster

uiwi

H=∑i=1pwi∑i=1pui+∑i=1pwi (7.17)

p=20

labelsandtheclasslabels.Butwhyisthisofinterest?Afterall,ifwehavetheclasslabels,thenwhatisthepointinperformingaclusteranalysis?Motivationsforsuchananalysisincludethecomparisonofclusteringtechniqueswiththe“groundtruth”ortheevaluationoftheextenttowhichamanualclassificationprocesscanbeautomaticallyproducedbyclusteranalysis,e.g.,theclusteringofnewsarticles.Anotherpotentialmotivationcouldbetoevaluatewhetherobjectsinthesameclustertendtohavethesamelabelforsemi-supervisedlearningtechniques.

Weconsidertwodifferentkindsofapproaches.Thefirstsetoftechniquesusemeasuresfromclassification,suchasentropy,purity,andtheF-measure.Thesemeasuresevaluatetheextenttowhichaclustercontainsobjectsofasingleclass.Thesecondgroupofmethodsisrelatedtothesimilaritymeasuresforbinarydata,suchastheJaccardmeasurethatwesawinChapter2 .Theseapproachesmeasuretheextenttowhichtwoobjectsthatareinthesameclassareinthesameclusterandviceversa.Forconvenience,wewillrefertothesetwotypesofmeasuresasclassification-orientedandsimilarity-oriented,respectively.

Classification-OrientedMeasuresofClusterValidityThereareanumberofmeasuresthatarecommonlyusedtoevaluatetheperformanceofaclassificationmodel.Inthissection,wewilldiscussfive:entropy,purity,precision,recall,andtheF-measure.Inthecaseofclassification,wemeasurethedegreetowhichpredictedclasslabelscorrespondtoactualclasslabels,butforthemeasuresjustmentioned,nothingfundamentalischangedbyusingclusterlabelsinsteadofpredictedclasslabels.Next,wequicklyreviewthedefinitionsofthesemeasuresinthecontextofclustering.

Entropy:Thedegreetowhicheachclusterconsistsofobjectsofasingleclass.Foreachcluster,theclassdistributionofthedataiscalculatedfirst,i.e.,forclusteriwecompute theprobabilitythatamemberofclusteribelongstoclassjas where isthenumberofobjectsinclusteriand isthenumberofobjectsofclassjinclusteri.Usingthisclassdistribution,theentropyofeachclusteriiscalculatedusingthestandardformula,

,whereListhenumberofclasses.Thetotalentropyforasetofclustersiscalculatedasthesumoftheentropiesofeachclusterweightedbythesizeofeachcluster,i.e., whereKisthenumberofclustersandmisthetotalnumberofdatapoints.

Purity:Anothermeasureoftheextenttowhichaclustercontainsobjectsofasingleclass.Usingthepreviousterminology,thepurityofclusteriis

theoverallpurityofaclusteringis

Precision:Thefractionofaclusterthatconsistsofobjectsofaspecifiedclass.Theprecisionofclusteriwithrespecttoclassjis

Recall:Theextenttowhichaclustercontainsallobjectsofaspecifiedclass.Therecallofclusteriwithrespecttoclassjis where isthenumberofobjectsinclassj.

F-measureAcombinationofbothprecisionandrecallthatmeasurestheextenttowhichaclustercontainsonlyobjectsofaparticularclassandallobjectsofthatclass.TheF-measureofclusteriwithrespecttoclassjis

TheF-measureofasetofclusters,partitionalorhierarchicalispresentedonpage594whenwediscussclustervalidityforhierarchicalclusterings.

Example7.15(SupervisedEvaluationMeasures).Wepresentanexampletoillustratethesemeasures.Specifically,weuseK-meanswiththecosinesimilaritymeasuretocluster3204newspaper

pij,pij=mij/mi, mi mij

ei=−∑j=1Lpij log2pij

e=∑i=1Kmimei,

purity(i)=maxjpij, purity=∑i=1Kmimpurity(i).

precision(i,j)=pij.

recall(i,j)=mij/mj, mj

F(i,j)=(2×precision(i,j)×recall(i,j))/(precision(i,j)+recall(i,j)).

articlesfromtheLosAngelesTimes.Thesearticlescomefromsixdifferentclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Table7.9 showstheresultsofaK-meansclusteringtofindsixclusters.Thefirstcolumnindicatesthecluster,whilethenextsixcolumnstogetherformtheconfusionmatrix;i.e.,thesecolumnsindicatehowthedocumentsofeachcategoryaredistributedamongtheclusters.Thelasttwocolumnsaretheentropyandpurityofeachcluster,respectively.

Table7.9.K-meansclusteringresultsfortheLATimesdocumentdataset.

Ideally,eachclusterwillcontaindocumentsfromonlyoneclass.Inreality,eachclustercontainsdocumentsfrommanyclasses.Nevertheless,manyclusterscontaindocumentsprimarilyfromjustoneclass.Inparticular,cluster3,whichcontainsmostlydocumentsfromtheSportssection,isexceptionallygood,bothintermsofpurityandentropy.Thepurityandentropyoftheotherclustersisnotasgood,butcantypicallybegreatlyimprovedifthedataispartitionedintoalargernumberofclusters.

Cluster Entertainment Financial Foreign Metro National Sports Entropy Purity

1 3 5 40 506 96 27 1.2270 0.7474

2 4 7 280 29 39 2 1.1472 0.7756

3 1 1 1 7 4 671 0.1813 0.9796

4 10 162 3 119 73 2 1.7487 0.4390

5 331 22 5 70 13 23 1.3976 0.7134

6 5 358 12 212 48 13 1.5523 0.5525

Total 354 555 341 943 273 738 1.1450 0.7203

Precision,recall,andtheF-measurecanbecalculatedforeachcluster.Togiveaconcreteexample,weconsidercluster1andtheMetroclassofTable7.9 .Theprecisionis recallis andhence,theFvalueis0.39.Incontrast,theFvalueforcluster3andSportsis0.94.Asinclassification,theconfusionmatrixgivesthemostdetailedinformation.

Similarity-OrientedMeasuresofClusterValidityThemeasuresthatwediscussinthissectionareallbasedonthepremisethatanytwoobjectsthatareinthesameclustershouldbeinthesameclassandviceversa.Wecanviewthisapproachtoclustervalidityasinvolvingthecomparisonoftwomatrices:(1)theidealclustersimilaritymatrixdiscussedpreviously,whichhasa1inthe entryiftwoobjects,iandj,areinthesameclusterand0,otherwise,and(2)aclasssimilaritymatrixdefinedwithrespecttoclasslabels,whichhasa1inthe entryiftwoobjects,iandj,belongtothesameclass,anda0otherwise.Asbefore,wecantakethecorrelationofthesetwomatricesasthemeasureofclustervalidity.ThismeasureisknownasHubert’sΓstatisticintheclusteringvalidationliterature.

Example7.16(CorrelationbetweenClusterandClassMatrices).Todemonstratethisideamoreconcretely,wegiveanexampleinvolvingfivedatapoints, and twoclusters, and

andtwoclasses, and TheidealclusterandclasssimilaritymatricesaregiveninTables7.10 and7.11 .Thecorrelationbetweentheentriesofthesetwomatricesis0.359.

Table7.10.Idealclustersimilaritymatrix.

506/677=0.75, 506/943=0.26,

ijth

ijth

p1,p2,p3,p4, p5, C1={p1,p2,p3}C2={p4,p5}, L1={p1,p2} L2={p3,p4,p5}.

Point p1 p2 p3 p4 p5

p1 1 1 1 0 0

p2 1 1 1 0 0

p3 1 1 1 0 0

p4 0 0 0 1 1

p5 0 0 0 1 1

Table7.11.Classsimilaritymatrix.

Point p1 p2 p3 p4 p5

p1 1 1 0 0 0

p2 1 1 0 0 0

p3 0 0 1 1 1

p4 0 0 1 1 1

p5 0 0 1 1 1

Moregenerally,wecanuseanyofthemeasuresforbinarysimilaritythatwesawinSection2.4.5 .(Forexample,wecanconvertthesetwomatricesintobinaryvectorsbyappendingtherows.)Werepeatthedefinitionsofthefourquantitiesusedtodefinethosesimilaritymeasures,butmodifyourdescriptivetexttofitthecurrentcontext.Specifically,weneedtocomputethefollowingfourquantitiesforallpairsofdistinctobjects.(Thereare suchpairs,ifmisthenumberofobjects.)

m(m−1)/2

f00=numberofpairsofobjectshavingadifferentclassandadifferentclusterf01

Inparticular,thesimplematchingcoefficient,whichisknownastheRandstatisticinthiscontext,andtheJaccardcoefficientaretwoofthemostfrequentlyusedclustervaliditymeasures.

Example7.17(RandandJaccardMeasures).Basedontheseformulas,wecanreadilycomputetheRandstatisticandJaccardcoefficientfortheexamplebasedonTables7.10 and7.11 .Notingthat and theandthe

Wealsonotethatthefourquantities, and defineacontingencytableasshowninTable7.12 .

Table7.12.Two-waycontingencytablefordeterminingwhetherpairsofobjectsareinthesameclassandsamecluster.

SameCluster DifferentCluster

SameClass

DifferentClass

Previously,inthecontextofassociationanalysis—seeSection5.7.1 onpage402—wepresentedanextensivediscussionofmeasuresofassociationthatcanbeusedforthistypeofcontingencytable.(CompareTable7.12 onpage593withTable5.6 onpage402.)Thosemeasurescanalsobeappliedtoclustervalidity.

Randstatistic=f00+f11f00+f01+f10+f11 (7.18)

Jaccardcoefficient=f11f01+f10+f11 (7.19)

f00=4,f01=2,f10=2, f11=2, Randstatistic=(2+4)/10=0.6Jaccardcoefficient=2/(2+2+2)=0.33.

f00,f01,f10, f11,

f11 f10

f01 f00

ClusterValidityforHierarchicalClusteringsSofarinthissection,wehavediscussedsupervisedmeasuresofclustervalidityonlyforpartitionalclusterings.Supervisedevaluationofahierarchicalclusteringismoredifficultforavarietyofreasons,includingthefactthatapreexistinghierarchicalstructureoftendoesnotexist.Inaddition,althoughrelativelypureclustersoftenexistatcertainlevelsinthehierarchicalclustering,astheclusteringproceeds,theclusterswillbecomeimpure.Thekeyideaoftheapproachpresentedhere,whichisbasedontheF-measure,istoevaluatewhetherahierarchicalclusteringcontains,foreachclass,atleastoneclusterthatisrelativelypureandincludesmostoftheobjectsofthatclass.Toevaluateahierarchicalclusteringwithrespecttothisgoal,wecompute,foreachclass,theF-measureforeachclusterintheclusterhierarchy,andthentakethemaximumF-measureattainedforanycluster.Finally,wecalculateanoverallF-measureforthehierarchicalclusteringbycomputingtheweightedaverageofallper-classF-measures,wheretheweightsarebasedontheclasssizes.Moreformally,thishierarchicalF-measureisdefinedasfollows:

wherethemaximumistakenoverallclustersiatalllevels, isthenumberofobjectsinclassj,andmisthetotalnumberofobjects.Notethatthismeasurecanalsobeappliedforapartitionalclusteringwithoutmodification.

7.5.8AssessingtheSignificanceofClusterValidityMeasures

F=∑jmjmmaxiF(i,j)

mj

Clustervaliditymeasuresareintendedtohelpusmeasurethegoodnessoftheclustersthatwehaveobtained.Indeed,theytypicallygiveusasinglenumberasameasureofthatgoodness.However,wearethenfacedwiththeproblemofinterpretingthesignificanceofthisnumber,ataskthatmightbeevenmoredifficult.

Theminimumandmaximumvaluesofclusterevaluationmeasurescanprovidesomeguidanceinmanycases.Forinstance,bydefinition,apurityof0isbad,whileapurityof1isgood,atleastifwetrustourclasslabelsandwantourclusterstructuretoreflecttheclassstructure.Likewise,anentropyof0isgood,asisanSSEof0.

Sometimes,however,thereisnominimumormaximumvalue,orthescaleofthedatamightaffecttheinterpretation.Also,evenifthereareminimumandmaximumvalueswithobviousinterpretations,intermediatevaluesstillneedtobeinterpreted.Insomecases,wecanuseanabsolutestandard.If,forexample,weareclusteringforutility,weareoftenwillingtotolerateonlyacertainleveloferrorintheapproximationofourpointsbyaclustercentroid.

Butifthisisnotthecase,thenwemustdosomethingelse.Acommonapproachistointerpretthevalueofourvaliditymeasureinstatisticalterms.Specifically,weattempttojudgehowlikelyitisthatourobservedvaluewasachievedbyrandomchance.Thevalueisgoodifitisunusual;i.e.,ifitisunlikelytobetheresultofrandomchance.Themotivationforthisapproachisthatweareinterestedonlyinclustersthatreflectnon-randomstructureinthedata,andsuchstructuresshouldgenerateunusuallyhigh(low)valuesofourclustervaliditymeasure,atleastifthevaliditymeasuresaredesignedtoreflectthepresenceofstrongclusterstructure.

Example7.18(SignificanceofSSE).

Toshowhowthisworks,wepresentanexamplebasedonK-meansandtheSSE.Supposethatwewantameasureofhowgoodthewell-separatedclustersofFigure7.30 arewithrespecttorandomdata.Wegeneratemanyrandomsetsof100pointshavingthesamerangeasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure7.34 showsthehistogramoftheSSEfrom500randomruns.ThelowestSSEshowninFigure7.34 is0.0173.ForthethreeclustersofFigure7.30 ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure7.30 couldoccurbychance.

Figure7.34.HistogramofSSEfor500randomdatasets.

Inthepreviousexample,randomizationwasusedtoevaluatethestatisticalsignificanceofaclustervaliditymeasure.However,forsomemeasures,such

asHubert’sΓstatistic,thedistributionisknownandcanbeusedtoevaluatethemeasure.Inaddition,anormalizedversionofthemeasurecanbecomputedbysubtractingthemeananddividingbythestandarddeviation.SeeBibliographicNotesformoredetails.

Westressthatthereismoretoclusterevaluation(unsupervisedorsupervised)thanobtaininganumericalmeasureofclustervalidity.Unlessthisvaluehasanaturalinterpretationbasedonthedefinitionofthemeasure,weneedtointerpretthisvalueinsomeway.Ifourclusterevaluationmeasureisdefinedsuchthatlower(higher)valuesindicatestrongerclusters,thenwecanusestatisticstoevaluatewhetherthevaluewehaveobtainedisunusuallylow(high),providedwehaveadistributionfortheevaluationmeasure.Wehavepresentedanexampleofhowtofindsuchadistribution,butthereisconsiderablymoretothistopic,andwereferthereadertotheBibliographicNotesformorepointers.

Finally,evenwhenanevaluationmeasureisusedasarelativemeasure,i.e.,tocomparetwoclusterings,westillneedtoassessthesignificanceinthedifferencebetweentheevaluationmeasuresofthetwoclusterings.Althoughonevaluewillalmostalwaysbebetterthananother,itcanbedifficulttodetermineifthedifferenceissignificant.Notethattherearetwoaspectstothissignificance:whetherthedifferenceisstatisticallysignificant(repeatable)andwhetherthemagnitudeofthedifferenceismeaningfulwithrespecttotheapplication.Manywouldnotregardadifferenceof0.1%assignificant,evenifitisconsistentlyreproducible.

7.5.9ChoosingaClusterValidityMeasure

Therearemanymoremeasuresforevaluatingclustervaliditythanhavebeendiscussedhere.Variousbooksandarticlesproposevariousmeasuresasbeingsuperiortoothers.Inthissection,weoffersomehigh-levelguidance.First,itisimportanttodistinguishwhethertheclusteringisbeingperformedforsummarizationorunderstanding.Ifsummarization,typicallyclasslabelsarenotinvolvedandthegoalismaximumcompression.Thisisoftendonebyfindingclustersthatminimizethedistanceofobjectstotheirclosestclustercentroid.Indeed,theclusteringprocessisoftendrivenbytheobjectiveofminimizingrepresentationerror.MeasuressuchasSSEaremorenaturalforthisapplication.

Iftheclusteringisbeingperformedforunderstanding,thenthesituationismorecomplicated.Fortheunsupervisedcase,virtuallyallmeasurestrytomaximizecohesionandseparation.Somemeasuresobtaina“best”valueforaparticularvalueofK,thenumberofclusters.Althoughthismightseemappealing,suchmeasurestypicallyonlyidentifyone“right”numberofclusters,evenwhensubclustersarepresent.(RecallthatcohesionandseparationcontinuetoincreaseforK-meansuntilthereisoneclusterperpoint.)Moregenerally,ifthenumberofclustersisnottoolarge,itcanbeusefultomanuallyexaminetheclustercohesionofeachclusterandthepairwiseseparationofclusters.However,notethatveryfewclustervaliditymeasuresareappropriatetocontiguityordensity-basedclustersthatcanhaveirregularandintertwinedshapes.

Forthesupervisedcase,whereclusteringisalmostalwaysperformedwithagoalofgeneratingunderstandableclusters,theidealresultofclusteringistoproduceclustersthatmatchtheunderlyingclassstructure.Evaluatingthematchbetweenasetofclustersandclassesisanon-trivialproblem.TheF-measureandhierarchicalF-measurediscussedearlier,areexamplesofhowtoevaluatesuchamatch.OtherexamplescanbefoundinthereferencestoclusterevaluationprovidedintheBibliographicNotes.Inanycase,whenthe

numberofclustersarerelativelysmall,theconfusionmatrixcanbemoreinformativethananysinglemeasureofclustervaliditysinceitanindicatewhichclassestendtobeappearinclusterswithwhichotherclasses.Notethatsupervisedclusterevaluationindicesareindependentofwhethertheclustersarecenter-,contiguity-,ordensity-based.

Inconclusion,itisimportanttorealizethatclusteringisoftenusedasanexploratorydatatechniquewhosegoalisoftennottoprovideacrispanswer,butrathertoprovidesomeinsightintotheunderlyingstructureofthedata.Inthissituation,clustervalidityindicesareusefultotheextenttheyareusefultothatendgoal.

7.6BibliographicNotesDiscussioninthischapterhasbeenmostheavilyinfluencedbythebooksonclusteranalysiswrittenbyJainandDubes[536],Anderberg[509],andKaufmanandRousseeuw[540],aswellasthemorerecentbookeditedbyandAggarwalandReddy[507].AdditionalclusteringbooksthatmayalsobeofinterestincludethosebyAldenderferandBlashfield[508],Everittetal.[527],Hartigan[533],Mirkin[548],Murtagh[550],Romesburg[553],andSpäth[557].AmorestatisticallyorientedapproachtoclusteringisgivenbythepatternrecognitionbookofDudaetal.[524],themachinelearningbookofMitchell[549],andthebookonstatisticallearningbyHastieetal.[534].GeneralsurveysofclusteringaregivenbyJainetal.[537]andXuandWunsch[560],whileasurveyofspatialdataminingtechniquesisprovidedbyHanetal.[532].Behrkin[515]providesasurveyofclusteringtechniquesfordatamining.AgoodsourceofreferencestoclusteringoutsideofthedataminingfieldisthearticlebyArabieandHubert[511].ApaperbyKleinberg[541]providesadiscussionofsomeofthetrade-offsthatclusteringalgorithmsmakeandprovesthatitisimpossibleforaclusteringalgorithmtosimultaneouslypossessthreesimpleproperties.Awide-ranging,retrospectivearticlebyJainprovidesalookatclusteringduringthe50yearsfromtheinventionofK-means[535].

TheK-meansalgorithmhasalonghistory,butisstillthesubjectofcurrentresearch.TheK-meansalgorithmwasnamedbyMacQueen[545],althoughitshistoryismoreextensive.BockexaminestheoriginsofK-meansandsomeofitsextensions[516].TheISODATAalgorithmbyBallandHall[513]wasanearly,butsophisticatedversionofK-meansthatemployedvariouspre-andpostprocessingtechniquestoimproveonthebasicalgorithm.TheK-meansalgorithmandmanyofitsvariationsaredescribedindetailinthebooksby

Anderberg[509]andJainandDubes[536].ThebisectingK-meansalgorithmdiscussedinthischapterwasdescribedinapaperbySteinbachetal.[558],andanimplementationofthisandotherclusteringapproachesisfreelyavailableforacademicuseintheCLUTO(CLUsteringTOolkit)packagecreatedbyKarypis[520].Boley[517]hascreatedadivisivepartitioningclusteringalgorithm(PDDP)basedonfindingthefirstprincipaldirection(component)ofthedata,andSavaresiandBoley[555]haveexploreditsrelationshiptobisectingK-means.RecentvariationsofK-meansareanewincrementalversionofK-means(Dhillonetal.[522]),X-means(PellegandMoore[552]),andK-harmonicmeans(Zhangetal[562]).HamerlyandElkan[531]discusssomeclusteringalgorithmsthatproducebetterresultsthanK-means.WhilesomeofthepreviouslymentionedapproachesaddresstheinitializationproblemofK-meansinsomemanner,otherapproachestoimprovingK-meansinitializationcanalsobefoundintheworkofBradleyandFayyad[518].TheK-means++initializationapproachwasproposedbyArthurandVassilvitskii[512].DhillonandModha[523]presentageneralizationofK-means,calledsphericalK-means,whichworkswithcommonlyusedsimilarityfunctions.AgeneralframeworkforK-meansclusteringthatusesdissimilarityfunctionsbasedonBregmandivergenceswasconstructedbyBanerjeeetal.[514].

Hierarchicalclusteringtechniquesalsohavealonghistory.MuchoftheinitialactivitywasintheareaoftaxonomyandiscoveredinbooksbyJardineandSibson[538]andSneathandSokal[556].General-purposediscussionsofhierarchicalclusteringarealsoavailableinmostoftheclusteringbooksmentionedabove.Agglomerativehierarchicalclusteringisthefocusofmostworkintheareaofhierarchicalclustering,butdivisiveapproacheshavealsoreceivedsomeattention.Forexample,Zahn[561]describesadivisivehierarchicaltechniquethatusestheminimumspanningtreeofagraph.Whilebothdivisiveandagglomerativeapproachestypicallytaketheviewthatmerging(splitting)decisionsarefinal,therehasbeensomeworkbyFisher

[528]andKarypisetal.[539]toovercometheselimitations.MurtaghandContrerasprovidearecentoverviewofhierarchicalclusteringalgorithms[551]andhavealsoproposedalineartimehierarchicalclusteringalgorithm[521].

Esteretal.proposedDBSCAN[526],whichwaslatergeneralizedtotheGDBSCANalgorithmbySanderetal.[554]inordertohandlemoregeneraltypesofdataanddistancemeasures,suchaspolygonswhoseclosenessismeasuredbythedegreeofintersection.AnincrementalversionofDBSCANwasdevelopedbyKriegeletal.[525].OneinterestingoutgrowthofDBSCANisOPTICS(OrderingPointsToIdentifytheClusteringStructure)(Ankerstetal.[510]),whichallowsthevisualizationofclusterstructureandcanalsobeusedforhierarchicalclustering.Arecentdiscussionofdensity-basedclusteringbyKriegeletal.[542]providesaveryreadablesynopsisofdensity-basedclusteringandrecentdevelopments.

Anauthoritativediscussionofclustervalidity,whichstronglyinfluencedthediscussioninthischapter,isprovidedinChapter4 ofJainandDubes’clusteringbook[536].ArecentreviewofclustervaliditymeasuresbyXiongandLicanbefoundin[559].OtherrecentreviewsofclustervalidityarethoseofHalkidietal.[529,530]andMilligan[547].SilhouettecoefficientsaredescribedinKaufmanandRousseeuw’sclusteringbook[540].ThesourceofthecohesionandseparationmeasuresinTable7.6 isapaperbyZhaoandKarypis[563],whichalsocontainsadiscussionofentropy,purity,andthehierarchicalF-measure.TheoriginalsourceofthehierarchicalF-measureisanarticlebyLarsenandAone[543].TheCVNNmeasurewasintroducedbyLietal.[544].Anaxiomaticapproachtoclusteringvalidityispresentedin[546].ManyofthepopularindicesforclustervalidationareimplementedintheNbClustRpackage,whichisdescribedinthearticlebyCharradetal.[519].

Bibliography[507]C.C.AggarwalandC.K.Reddy,editors.DataClustering:Algorithms

andApplications.Chapman&Hall/CRC,1stedition,2013.

[508]M.S.AldenderferandR.K.Blashfield.ClusterAnalysis.SagePublications,LosAngeles,1985.

[509]M.R.Anderberg.ClusterAnalysisforApplications.AcademicPress,NewYork,December1973.

[510]M.Ankerst,M.M.Breunig,H.-P.Kriegel,andJ.Sander.OPTICS:OrderingPointsToIdentifytheClusteringStructure.InProc.of1999ACM-SIGMODIntl.Conf.onManagementofData,pages49–60,Philadelphia,Pennsylvania,June1999.ACMPress.

[511]P.Arabie,L.Hubert,andG.D.Soete.Anoverviewofcombinatorialdataanalysis.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages188–217.WorldScientific,Singapore,January1996.

[512]D.ArthurandS.Vassilvitskii.k-means++:Theadvantagesofcarefulseeding.InProceedingsoftheeighteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages1027–1035.SocietyforIndustrialandAppliedMathematics,2007.

[513]G.BallandD.Hall.AClusteringTechniqueforSummarizingMultivariateData.BehaviorScience,12:153–155,March1967.

[514]A.Banerjee,S.Merugu,I.S.Dhillon,andJ.Ghosh.ClusteringwithBregmanDivergences.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages234–245,LakeBuenaVista,FL,April2004.

[515]P.Berkhin.SurveyOfClusteringDataMiningTechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.

[516]H.-H.Bock.Originsandextensionsofthe-meansalgorithminclusteranalysis.JournalÉlectroniqued’HistoiredesProbabilitésetdelaStatistique[electroniconly],4(2):Article–14,2008.

[517]D.Boley.PrincipalDirectionDivisivePartitioning.DataMiningandKnowledgeDiscovery,2(4):325–344,1998.

[518]P.S.BradleyandU.M.Fayyad.RefiningInitialPointsforK-MeansClustering.InProc.ofthe15thIntl.Conf.onMachineLearning,pages91–99,Madison,WI,July1998.MorganKaufmannPublishersInc.

[519]M.Charrad,N.Ghazzali,V.Boiteau,andA.Niknafs.NbClust:anRpackagefordeterminingtherelevantnumberofclustersinadataset.JournalofStatisticalSoftware,61(6):1–36,2014.

[520]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.

[521]P.ContrerasandF.Murtagh.Fast,lineartimehierarchicalclusteringusingtheBairemetric.Journalofclassification,29(2):118–143,2012.

[522]I.S.Dhillon,Y.Guan,andJ.Kogan.IterativeClusteringofHighDimensionalTextDataAugmentedbyLocalSearch.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages131–138.IEEEComputerSociety,2002.

[523]I.S.DhillonandD.S.Modha.ConceptDecompositionsforLargeSparseTextDataUsingClustering.MachineLearning,42(1/2):143–175,2001.

[524]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification.JohnWiley&Sons,Inc.,NewYork,secondedition,2001.

[525]M.Ester,H.-P.Kriegel,J.Sander,M.Wimmer,andX.Xu.IncrementalClusteringforMininginaDataWarehousingEnvironment.InProc.ofthe24thVLDBConf.,pages323–333,NewYorkCity,August1998.MorganKaufmann.

[526]M.Ester,H.-P.Kriegel,J.Sander,andX.Xu.ADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoise.InProc.ofthe2ndIntl.Conf.onKnowledgeDiscoveryandDataMining,pages226–231,Portland,Oregon,August1996.AAAIPress.

[527]B.S.Everitt,S.Landau,andM.Leese.ClusterAnalysis.ArnoldPublishers,London,4thedition,May2001.

[528]D.Fisher.IterativeOptimizationandSimplificationofHierarchicalClusterings.JournalofArtificialIntelligenceResearch,4:147–179,1996.

[529]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clustervaliditymethods:partI.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(2):40–45,June2002.

[530]M.Halkidi,Y.Batistakis,andM.Vazirgiannis.Clusteringvaliditycheckingmethods:partII.SIGMODRecord(ACMSpecialInterestGrouponManagementofData),31(3):19–27,Sept.2002.

[531]G.HamerlyandC.Elkan.Alternativestothek-meansalgorithmthatfindbetterclusterings.InProc.ofthe11thIntl.Conf.onInformationandKnowledgeManagement,pages600–607,McLean,Virginia,2002.ACMPress.

[532]J.Han,M.Kamber,andA.Tung.SpatialClusteringMethodsinDataMining:Areview.InH.J.MillerandJ.Han,editors,GeographicDataMiningandKnowledgeDiscovery,pages188–217.TaylorandFrancis,London,December2001.

[533]J.Hartigan.ClusteringAlgorithms.Wiley,NewYork,1975.

[534]T.Hastie,R.Tibshirani,andJ.H.Friedman.TheElementsofStatisticalLearning:DataMining,Inference,Prediction.Springer,NewYork,2001.

[535]A.K.Jain.Dataclustering:50yearsbeyondK-means.Patternrecognitionletters,31(8):651–666,2010.

[536]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[537]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview.ACMComputingSurveys,31(3):264–323,September1999.

[538]N.JardineandR.Sibson.MathematicalTaxonomy.Wiley,NewYork,1971.

[539]G.Karypis,E.-H.Han,andV.Kumar.MultilevelRefinementforHierarchicalClustering.TechnicalReportTR99-020,UniversityofMinnesota,Minneapolis,MN,1999.

[540]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData:AnIntroductiontoClusterAnalysis.WileySeriesinProbabilityandStatistics.JohnWileyandSons,NewYork,November1990.

[541]J.M.Kleinberg.AnImpossibilityTheoremforClustering.InProc.ofthe16thAnnualConf.onNeuralInformationProcessingSystems,December,9–142002.

[542]H.-P.Kriegel,P.Kröger,J.Sander,andA.Zimek.Density-basedclustering.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,1(3):231–240,2011.

[543]B.LarsenandC.Aone.FastandEffectiveTextMiningUsingLinear-TimeDocumentClustering.InProc.ofthe5thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages16–22,SanDiego,California,1999.ACMPress.

[544]Y.Liu,Z.Li,H.Xiong,X.Gao,J.Wu,andS.Wu.Understandingandenhancementofinternalclusteringvalidationmeasures.Cybernetics,IEEETransactionson,43(3):982–994,2013.

[545]J.MacQueen.Somemethodsforclassificationandanalysisofmultivariateobservations.InProc.ofthe5thBerkeleySymp.onMathematicalStatisticsandProbability,pages281–297.UniversityofCaliforniaPress,1967.

[546]M.Meilă.ComparingClusterings:AnAxiomaticView.InProceedingsofthe22NdInternationalConferenceonMachineLearning,ICML’05,pages577–584,NewYork,NY,USA,2005.ACM.

[547]G.W.Milligan.ClusteringValidation:ResultsandImplicationsforAppliedAnalyses.InP.Arabie,L.Hubert,andG.D.Soete,editors,ClusteringandClassification,pages345–375.WorldScientific,Singapore,January1996.

[548]B.Mirkin.MathematicalClassificationandClustering,volume11ofNonconvexOptimizationandItsApplications.KluwerAcademicPublishers,August1996.

[549]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[550]F.Murtagh.MultidimensionalClusteringAlgorithms.Physica-Verlag,HeidelbergandVienna,1985.

[551]F.MurtaghandP.Contreras.Algorithmsforhierarchicalclustering:anoverview.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,2(1):86–97,2012.

[552]D.PellegandA.W.Moore.X-means:ExtendingK-meanswithEfficientEstimationoftheNumberofClusters.InProc.ofthe17thIntl.Conf.onMachineLearning,pages727–734.MorganKaufmann,SanFrancisco,CA,2000.

[553]C.Romesburg.ClusterAnalysisforResearchers.LifeTimeLearning,Belmont,CA,1984.

[554]J.Sander,M.Ester,H.-P.Kriegel,andX.Xu.Density-BasedClusteringinSpatialDatabases:TheAlgorithmGDBSCANanditsApplications.DataMiningandKnowledgeDiscovery,2(2):169–194,1998.

[555]S.M.SavaresiandD.Boley.AcomparativeanalysisonthebisectingK-meansandthePDDPclusteringalgorithms.IntelligentDataAnalysis,8(4):345–362,2004.

[556]P.H.A.SneathandR.R.Sokal.NumericalTaxonomy.Freeman,SanFrancisco,1971.

[557]H.Späth.ClusterAnalysisAlgorithmsforDataReductionandClassificationofObjects,volume4ofComputersandTheirApplication.EllisHorwoodPublishers,Chichester,1980.ISBN0-85312-141-9.

[558]M.Steinbach,G.Karypis,andV.Kumar.AComparisonofDocumentClusteringTechniques.InProc.ofKDDWorkshoponTextMining,Proc.ofthe6thIntl.Conf.onKnowledgeDiscoveryandDataMining,Boston,MA,August2000.

[559]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.

[560]R.Xu,D.Wunsch,etal.Surveyofclusteringalgorithms.NeuralNetworks,IEEETransactionson,16(3):645–678,2005.

[561]C.T.Zahn.Graph-TheoreticalMethodsforDetectingandDescribingGestaltClusters.IEEETransactionsonComputers,C-20(1):68–86,Jan.1971.

[562]B.Zhang,M.Hsu,andU.Dayal.K-HarmonicMeans—ADataClusteringAlgorithm.TechnicalReportHPL-1999-124,HewlettPackardLaboratories,Oct.291999.

[563]Y.ZhaoandG.Karypis.Empiricalandtheoreticalcomparisonsofselectedcriterionfunctionsfordocumentclustering.MachineLearning,55(3):311–331,2004.

7.7Exercises1.Consideradatasetconsistingof datavectors,whereeachvectorhas32componentsandeachcomponentisa4-bytevalue.Supposethatvectorquantizationisusedforcompression,andthat prototypevectorsareused.Howmanybytesofstoragedoesthatdatasettakebeforeandaftercompressionandwhatisthecompressionratio?

2.Findallwell-separatedclustersinthesetofpointsshowninFigure7.35 .

Figure7.35.PointsforExercise2 .

3.Manypartitionalclusteringalgorithmsthatautomaticallydeterminethenumberofclustersclaimthatthisisanadvantage.Listtwosituationsinwhichthisisnotthecase.

4.GivenKequallysizedclusters,theprobabilitythatarandomlychoseninitialcentroidwillcomefromanygivenclusteris buttheprobabilitythateachclusterwillhaveexactlyoneinitialcentroidismuchlower.(ItshouldbeclearthathavingoneinitialcentroidineachclusterisagoodstartingsituationforK-means.)Ingeneral,ifthereareKclustersandeachclusterhasnpoints,thentheprobability,p,ofselectinginasampleofsizeKoneinitialcentroidfromeachclusterisgivenbyEquation7.20 .(Thisassumessamplingwith

220

216

1/K,

replacement.)Fromthisformulawecancalculate,forexample,thatthechanceofhavingoneinitialcentroidfromeachoffourclustersis

a. PlottheprobabilityofobtainingonepointfromeachclusterinasampleofsizeKforvaluesofKbetween2and100.

b. ForKclusters, and1000,findtheprobabilitythatasampleofsize2Kcontainsatleastonepointfromeachcluster.Youcanuseeithermathematicalmethodsorstatisticalsimulationtodeterminetheanswer.

5.IdentifytheclustersinFigure7.36 usingthecenter-,contiguity-,anddensity-baseddefinitions.Alsoindicatethenumberofclustersforeachcaseandgiveabriefindicationofyourreasoning.Notethatdarknessorthenumberofdotsindicatesdensity.Ifithelps,assumecenter-basedmeansK-means,contiguity-basedmeanssinglelink,anddensity-basedmeansDBSCAN.

Figure7.36.ClustersforExercise5 .

6.Forthefollowingsetsoftwo-dimensionalpoints,(1)provideasketchofhowtheywouldbesplitintoclustersbyK-meansforthegivennumberof

4!/44=0.0938.

p=numberofwaystoselectonecentroidfromeachclusternumberofwaystoselect(7.20)

K=10,100,

clustersand(2)indicateapproximatelywheretheresultingcentroidswouldbe.Assumethatweareusingthesquarederrorobjectivefunction.Ifyouthinkthatthereismorethanonepossiblesolution,thenpleaseindicatewhethereachsolutionisaglobalorlocalminimum.NotethatthelabelofeachdiagraminFigure7.37 matchesthecorrespondingpartofthisquestion,e.g.,Figure7.37(a) goeswithpart(a).

Figure7.37.DiagramsforExercise6 .

a. Assumingthatthepointsareuniformlydistributedinthecircle,howmanypossiblewaysarethere(intheory)topartitionthepointsintotwoclusters?Whatcanyousayaboutthepositionsofthetwocentroids?(Again,youdon’tneedtoprovideexactcentroidlocations,justaqualitativedescription.)

b. Thedistancebetweentheedgesofthecirclesisslightlygreaterthantheradiiofthecircles.

c. Thedistancebetweentheedgesofthecirclesismuchlessthantheradiiofthecircles.

d.

e. Hint:Usethesymmetryofthesituationandrememberthatwearelookingforaroughsketchofwhattheresultwouldbe.

K=2.

K=3.

K=3.

K=2.

K=3.

7.Supposethatforadataset

therearempointsandKclusters,

halfthepointsandclustersarein“moredense”regions,

halfthepointsandclustersarein“lessdense”regions,and

thetworegionsarewell-separatedfromeachother.

Forthegivendataset,whichofthefollowingshouldoccurinordertominimizethesquarederrorwhenfindingKclusters:

a. Centroidsshouldbeequallydistributedbetweenmoredenseandlessdenseregions.

b. Morecentroidsshouldbeallocatedtothelessdenseregion.

c. Morecentroidsshouldbeallocatedtothedenserregion.

Note:Donotgetdistractedbyspecialcasesorbringinfactorsotherthandensity.However,ifyoufeelthetrueanswerisdifferentfromanygivenabove,justifyyourresponse.

8.Considerthemeanofaclusterofobjectsfromabinarytransactiondataset.Whataretheminimumandmaximumvaluesofthecomponentsofthemean?Whatistheinterpretationofcomponentsoftheclustermean?Whichcomponentsmostaccuratelycharacterizetheobjectsinthecluster?

9.Giveanexampleofadatasetconsistingofthreenaturalclusters,forwhich(almostalways)K-meanswouldlikelyfindthecorrectclusters,butbisectingK-meanswouldnot.

10.WouldthecosinemeasurebetheappropriatesimilaritymeasuretousewithK-meansclusteringfortimeseriesdata?Whyorwhynot?Ifnot,whatsimilaritymeasurewouldbemoreappropriate?

11.TotalSSEisthesumoftheSSEforeachseparateattribute.WhatdoesitmeaniftheSSEforonevariableislowforallclusters?Lowforjustonecluster?Highforallclusters?Highforjustonecluster?HowcouldyouusethepervariableSSEinformationtoimproveyourclustering?

12.Theleaderalgorithm(Hartigan[533])representseachclusterusingapoint,knownasaleader,andassignseachpointtotheclustercorrespondingtotheclosestleader,unlessthisdistanceisaboveauser-specifiedthreshold.Inthatcase,thepointbecomestheleaderofanewcluster.

a. WhataretheadvantagesanddisadvantagesoftheleaderalgorithmascomparedtoK-means?

b. Suggestwaysinwhichtheleaderalgorithmmightbeimproved.

13.TheVoronoidiagramforasetofKpointsintheplaneisapartitionofallthepointsoftheplaneintoKregions,suchthateverypoint(oftheplane)isassignedtotheclosestpointamongtheKspecifiedpoints—seeFigure7.38 .WhatistherelationshipbetweenVoronoidiagramsandK-meansclusters?WhatdoVoronoidiagramstellusaboutthepossibleshapesofK-meansclusters?

Figure7.38.VoronoidiagramforExercise13 .

14.Youaregivenadatasetwith100recordsandareaskedtoclusterthedata.YouuseK-meanstoclusterthedata,butforallvaluesofK,theK-meansalgorithmreturnsonlyonenon-emptycluster.YouthenapplyanincrementalversionofK-means,butobtainexactlythesameresult.Howisthispossible?HowwouldsinglelinkorDBSCANhandlesuchdata?

15.Traditionalagglomerativehierarchicalclusteringroutinesmergetwoclustersateachstep.Doesitseemlikelythatsuchanapproachaccuratelycapturesthe(nested)clusterstructureofasetofdatapoints?Ifnot,explainhowyoumightpostprocessthedatatoobtainamoreaccurateviewoftheclusterstructure.

16.UsethesimilaritymatrixinTable7.13 toperformsingleandcompletelinkhierarchicalclustering.Showyourresultsbydrawingadendrogram.Thedendrogramshouldclearlyshowtheorderinwhichthepointsaremerged.

Table7.13.SimilaritymatrixforExercise16 .

p1 p2 p3 p4 p5

p1 1.00 0.10 0.41 0.55 0.35

p2 0.10 1.00 0.64 0.47 0.98

p3 0.41 0.64 1.00 0.44 0.85

p4 0.55 0.47 0.44 1.00 0.76

p5 0.35 0.98 0.85 0.76 1.00

17.HierarchicalclusteringissometimesusedtogenerateKclusters, bytakingtheclustersatthe levelofthedendrogram.(Rootisatlevel1.)Bylookingattheclustersproducedinthisway,wecanevaluatethebehaviorofhierarchicalclusteringondifferenttypesofdataandclusters,andalsocomparehierarchicalapproachestoK-means.

1≤K≤100,

K>1Kth

Thefollowingisasetofone-dimensionalpoints:{6,12,18,24,30,42,48}.

a. Foreachofthefollowingsetsofinitialcentroids,createtwoclustersbyassigningeachpointtothenearestcentroid,andthencalculatethetotalsquarederrorforeachsetoftwoclusters.Showboththeclustersandthetotalsquarederrorforeachsetofcentroids.

i. {18,45}

ii. {15,40}

b. Dobothsetsofcentroidsrepresentstablesolutions;i.e.,iftheK-meansalgorithmwasrunonthissetofpointsusingthegivencentroidsasthestartingcentroids,wouldtherebeanychangeintheclustersgenerated?

c. Whatarethetwoclustersproducedbysinglelink?

d. Whichtechnique,K-meansorsinglelink,seemstoproducethe“mostnatural”clusteringinthissituation?(ForK-means,taketheclusteringwiththelowestsquarederror.)

e. Whatdefinition(s)ofclusteringdoesthisnaturalclusteringcorrespondto?(Well-separated,center-based,contiguous,ordensity.)

f. Whatwell-knowncharacteristicoftheK-meansalgorithmexplainsthepreviousbehavior?

18.SupposewefindKclustersusingWard’smethod,bisectingK-means,andordinaryK-means.Whichofthesesolutionsrepresentsalocalorglobalminimum?Explain.

19.Hierarchicalclusteringalgorithmsrequire time,andconsequently,areimpracticaltousedirectlyonlargerdatasets.Onepossibletechniqueforreducingthetimerequiredistosamplethedataset.Forexample,ifKclustersaredesiredand pointsaresampledfromthem

O(m2log(m))

m

points,thenahierarchicalclusteringalgorithmwillproduceahierarchicalclusteringinroughly time.Kclusterscanbeextractedfromthishierarchicalclusteringbytakingtheclustersonthe levelofthedendrogram.Theremainingpointscanthenbeassignedtoaclusterinlineartime,byusingvariousstrategies.Togiveaspecificexample,thecentroidsoftheKclusterscanbecomputed,andtheneachofthe remainingpointscanbeassignedtotheclusterassociatedwiththeclosestcentroid.

Foreachofthefollowingtypesofdataorclusters,discussbrieflyif(1)samplingwillcauseproblemsforthisapproachand(2)whatthoseproblemsare.Assumethatthesamplingtechniquerandomlychoosespointsfromthetotalsetofmpointsandthatanyunmentionedcharacteristicsofthedataorclustersareasoptimalaspossible.Inotherwords,focusonlyonproblemscausedbytheparticularcharacteristicmentioned.Finally,assumethatKisverymuchlessthanm.

a. Datawithverydifferentsizedclusters.

b. High-dimensionaldata.

c. Datawithoutliers,i.e.,atypicalpoints.

d. Datawithhighlyirregularregions.

e. Datawithglobularclusters.

f. Datawithwidelydifferentdensities.

g. Datawithasmallpercentageofnoisepoints.

h. Non-Euclideandata.

i. Euclideandata.

j. Datawithmanyandmixedattributetypes.

O(m)Kth

m−m

20.ConsiderthefollowingfourfacesshowninFigure7.39 .Again,darknessornumberofdotsrepresentsdensity.Linesareusedonlytodistinguishregionsanddonotrepresentpoints.

Figure7.39.FigureforExercise20 .

a. Foreachfigure,couldyouusesinglelinktofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.

b. Foreachfigure,couldyouuseK-meanstofindthepatternsrepresentedbythenose,eyes,andmouth?Explain.

c. WhatlimitationdoesclusteringhaveindetectingallthepatternsformedbythepointsinFigure7.39(c) ?

21.ComputetheentropyandpurityfortheconfusionmatrixinTable7.14 .

Table7.14.ConfusionmatrixforExercise21 .

Cluster Entertainment Financial Foreign Metro National Sports Total

#1 1 1 0 11 4 676 693

#2 27 89 333 827 253 33 1562

#3 326 465 8 105 16 29 949

Total 354 555 341 943 273 738 3204

22.Youaregiventwosetsof100pointsthatfallwithintheunitsquare.Onesetofpointsisarrangedsothatthepointsareuniformlyspaced.Theothersetofpointsisgeneratedfromauniformdistributionovertheunitsquare.

a. Isthereadifferencebetweenthetwosetsofpoints?

b. Ifso,whichsetofpointswilltypicallyhaveasmallerSSEforclusters?

c. WhatwillbethebehaviorofDBSCANontheuniformdataset?Therandomdataset?

23.UsingthedatainExercise24 ,computethesilhouettecoefficientforeachpoint,eachofthetwoclusters,andtheoverallclustering.

24.GiventhesetofclusterlabelsandsimilaritymatrixshowninTables7.15 and7.16 ,respectively,computethecorrelationbetweenthesimilaritymatrixandtheidealsimilaritymatrix,i.e.,thematrixwhose entryis1iftwoobjectsbelongtothesamecluster,and0otherwise.

Table7.15.TableofclusterlabelsforExercise24 .

Point ClusterLabel

P1 1

P2 1

P3 2

P4 2

K=10

ijth

Table7.16.SimilaritymatrixforExercise24 .

Point P1 P2 P3 P4

P1 1 0.8 0.65 0.55

P2 0.8 1 0.7 0.6

P3 0.65 0.7 1 0.9

P4 0.55 0.6 0.9 1

25.ComputethehierarchicalF-measurefortheeightobjects{p1,p2,p3,p4,p5,p6,p7,andp8}andhierarchicalclusteringshowninFigure7.40 .ClassAcontainspointsp1,p2,andp3,whilep4,p5,p6,p7,andp8belongtoclassB.

Figure7.40.HierarchicalclusteringforExercise25 .

26.ComputethecopheneticcorrelationcoefficientforthehierarchicalclusteringsinExercise16 .(Youwillneedtoconvertthesimilaritiesintodissimilarities.)

27.ProveEquation7.14 .

28.ProveEquation7.16 .

29.Provethat Thisfactwasusedintheproofthat inSection7.5.2 .

30.Clustersofdocumentscanbesummarizedbyfindingthetopterms(words)forthedocumentsinthecluster,e.g.,bytakingthemostfrequentkterms,wherekisaconstant,say10,orbytakingalltermsthatoccurmorefrequentlythanaspecifiedthreshold.SupposethatK-meansisusedtofindclustersofbothdocumentsandwordsforadocumentdataset.

a. HowmightasetoftermclustersdefinedbythetoptermsinadocumentclusterdifferfromthewordclustersfoundbyclusteringthetermswithK-means?

b. Howcouldtermclusteringbeusedtodefineclustersofdocuments?

31.Wecanrepresentadatasetasacollectionofobjectnodesandacollectionofattributenodes,wherethereisalinkbetweeneachobjectandeachattribute,andwheretheweightofthatlinkisthevalueoftheobjectforthatattribute.Forsparsedata,ifthevalueis0,thelinkisomitted.Bipartiteclusteringattemptstopartitionthisgraphintodisjointclusters,whereeachclusterconsistsofasetofobjectnodesandasetofattributenodes.Theobjectiveistomaximizetheweightoflinksbetweentheobjectandattributenodesofacluster,whileminimizingtheweightoflinksbetweenobjectandattributelinksindifferentclusters.Thistypeofclusteringisalsoknownasco-clusteringbecausetheobjectsandattributesareclusteredatthesametime.

a. Howisbipartiteclustering(co-clustering)differentfromclusteringthesetsofobjectsandattributesseparately?

b. Arethereanycasesinwhichtheseapproachesyieldthesameclusters?

c. Whatarethestrengthsandweaknessesofco-clusteringascomparedtoordinaryclustering?

Σi=1KΣx∈Ci(x−mi)(m−mi)=0.TSS=SSE+SSB

32.InFigure7.41 ,matchthesimilaritymatrices,whicharesortedaccordingtoclusterlabels,withthesetsofpoints.Differencesinshadingandmarkershapedistinguishbetweenclusters,andeachsetofpointscontains100pointsandthreeclusters.Inthesetofpointslabeled2,therearethreeverytight,equalsizedclusters.

Figure7.41.PointsandsimilaritymatricesforExercise32 .

8ClusterAnalysis:AdditionalIssuesandAlgorithms

Alargenumberofclusteringalgorithmshavebeendevelopedinavarietyofdomainsfordifferenttypesofapplications.Noalgorithmissuitableforalltypesofdata,clusters,andapplications.Infact,itseemsthatthereisalwaysroomforanewclusteringalgorithmthatismoreefficientorbettersuitedtoaparticulartypeofdata,cluster,orapplication.Instead,wecanonlyclaimthatwehavetechniquesthatworkwellinsomesituations.Thereasonisthat,inmanycases,whatconstitutesagoodsetofclustersisopentosubjectiveinterpretation.Furthermore,whenanobjectivemeasureisemployedtogiveaprecisedefinitionofacluster,theproblemoffindingtheoptimalclusteringisoftencomputationallyinfeasible.

Thischapterfocusesonimportantissuesinclusteranalysisandexplorestheconceptsandapproachesthathavebeendevelopedtoaddressthem.Webeginwithadiscussionofthekeyissuesofclusteranalysis,namely,thecharacteristicsofdata,clusters,andalgorithmsthatstronglyimpactclustering.Theseissues

areimportantforunderstanding,describing,andcomparingclusteringtechniques,andprovidethebasisfordecidingwhichtechniquetouseinaspecificsituation.Forexample,manyclusteringalgorithmshaveatimeorspacecomplexityof (mbeingthenumberofobjects)and,thus,arenotsuitableforlargedatasets.Wethendiscussadditionalclusteringtechniques.Foreachtechnique,wedescribethealgorithm,includingtheissuesitaddressesandthemethodsthatitusestoaddressthem.Weconcludethischapterbyprovidingsomegeneralguidelinesforselectingaclusteringalgorithmforagivenapplication.

O(m2)

8.1CharacteristicsofData,Clusters,andClusteringAlgorithmsThissectionexploresissuesrelatedtothecharacteristicsofdata,clusters,andalgorithmsthatareimportantforabroadunderstandingofclusteranalysis.Someoftheseissuesrepresentchallenges,suchashandlingnoiseandoutliers.Otherissuesinvolveadesiredfeatureofanalgorithm,suchasanabilitytoproducethesameresultregardlessoftheorderinwhichthedataobjectsareprocessed.Thediscussioninthissection,alongwiththediscussionofdifferenttypesofclusteringsinSection7.1.2 anddifferenttypesofclustersinSection7.1.3 ,identifiesanumberof“dimensions”thatcanbeusedtodescribeandcomparevariousclusteringalgorithmsandtheclusteringresultsthattheyproduce.Toillustratethis,webeginthissectionwithanexamplethatcomparestwoclusteringalgorithmsthatweredescribedinthepreviouschapter,DBSCANandK-means.Thisisfollowedbyamoredetaileddescriptionofthecharacteristicsofdata,clusters,andalgorithmsthatimpactclusteranalysis.

8.1.1Example:ComparingK-meansandDBSCAN

Tosimplifythecomparison,weassumethattherearenotiesindistancesforeitherK-meansorDBSCANandthatDBSCANalwaysassignsaborderpointthatisassociatedwithseveralcorepointstotheclosestcorepoint.

BothDBSCANandK-meansarepartitionalclusteringalgorithmsthatassigneachobjecttoasinglecluster,butK-meanstypicallyclustersalltheobjects,whileDBSCANdiscardsobjectsthatitclassifiesasnoise.K-meansusesaprototype-basednotionofacluster;DBSCANusesadensity-basedconcept.DBSCANcanhandleclustersofdifferentsizesandshapesandisnotstronglyaffectedbynoiseoroutliers.K-meanshasdifficultywithnon-globularclustersandclustersofdifferentsizes.Bothalgorithmscanperformpoorlywhenclustershavewidelydifferingdensities.K-meanscanonlybeusedfordatathathasawell-definedcentroid,suchasameanormedian.DBSCANrequiresthatitsdefinitionofdensity,whichisbasedonthetraditionalEuclideannotionofdensity,bemeaningfulforthedata.K-meanscanbeappliedtosparse,high-dimensionaldata,suchasdocumentdata.DBSCANtypicallyperformspoorlyforsuchdatabecausethetraditionalEuclideandefinitionofdensitydoesnotworkwellforhigh-dimensionaldata.TheoriginalversionsofK-meansandDBSCANweredesignedforEuclideandata,butbothhavebeenextendedtohandleothertypesofdata.DBSCANmakesnoassumptionaboutthedistributionofthedata.ThebasicK-meansalgorithmisequivalenttoastatisticalclusteringapproach(mixturemodels)thatassumesallclusterscomefromsphericalGaussiandistributionswithdifferentmeansbutthesamecovariancematrix.SeeSection8.2.2 .DBSCANandK-meansbothlookforclustersusingallattributes,thatis,theydonotlookforclustersthatinvolveonlyasubsetoftheattributes.K-meanscanfindclustersthatarenotwellseparated,eveniftheyoverlap(seeFigure7.2(b) ),butDBSCANmergesclustersthatoverlap.TheK-meansalgorithmhasatimecomplexityof ,whileDBSCANtakes time,exceptforspecialcasessuchaslow-dimensional

O(m)O(m2)

Euclideandata.DBSCANproducesthesamesetofclustersfromoneruntoanother,whileK-means,whichistypicallyusedwithrandominitializationofcentroids,doesnot.DBSCANautomaticallydeterminesthenumberofclusters;forK-means,thenumberofclustersneedstobespecifiedasaparameter.However,DBSCANhastwootherparametersthatmustbespecified,EpsandMinPts.K-meansclusteringcanbeviewedasanoptimizationproblem;i.e.,minimizethesumofthesquarederrorofeachpointtoitsclosestcentroid,andasaspecificcaseofastatisticalclusteringapproach(mixturemodels).DBSCANisnotbasedonanyformalmodel.

8.1.2DataCharacteristics

Thefollowingaresomecharacteristicsofdatathatcanstronglyaffectclusteranalysis.

HighDimensionalityInhigh-dimensionaldatasets,thetraditionalEuclideannotionofdensity,whichisthenumberofpointsperunitvolume,becomesmeaningless.Toseethis,considerthatasthenumberofdimensionsincreases,thevolumeincreasesrapidly,andunlessthenumberofpointsgrowsexponentiallywiththenumberofdimensions,thedensitytendsto0.(Volumeisexponentialinthenumberofdimensions.Forinstance,ahyperspherewithradius,r,anddimension,d,hasvolumeproportionalto .)Also,proximitytendstobecomemoreuniforminhigh-dimensionalspaces.Anotherwaytoviewthisfactisthattherearemoredimensions(attributes)thatcontributetotheproximitybetweentwopointsandthistendstomaketheproximitymoreuniform.Sincemostclusteringtechniquesarebasedon

rd

proximityordensity,theycanoftenhavedifficultywithhigh-dimensionaldata.Onewaytoaddresssuchproblemsistoemploydimensionalityreductiontechniques.Anotherapproach,asdiscussedinSections8.4.6 and8.4.8 ,istoredefinethenotionsofproximityanddensity.

SizeManyclusteringalgorithmsthatworkwellforsmallormedium-sizedatasetsareunabletohandlelargerdatasets.Thisisaddressedfurtherinthediscussionofthecharacteristicsofclusteringalgorithms—scalabilityisonesuchcharacteristic—andinSection8.5 ,whichdiscussesscalableclusteringalgorithms.

SparsenessSparsedataoftenconsistsofasymmetricattributes,wherezerovaluesarenotasimportantasnon-zerovalues.Therefore,similaritymeasuresappropriateforasymmetricattributesarecommonlyused.However,other,relatedissuesalsoarise.Forexample,arethemagnitudesofnon-zeroentriesimportant,ordotheydistorttheclustering?Inotherwords,doestheclusteringworkbestwhenthereareonlytwovalues,0and1?

NoiseandOutliersAnatypicalpoint(outlier)canoftenseverelydegradetheperformanceofclusteringalgorithms,especiallyalgorithmssuchasK-meansthatareprototype-based.Ontheotherhand,noisecancausetechniques,suchassinglelink,tojoinclustersthatshouldnotbejoined.Insomecases,algorithmsforremovingnoiseandoutliersareappliedbeforeaclusteringalgorithmisused.Alternatively,somealgorithmscandetectpointsthatrepresentnoiseandoutliersduringtheclusteringprocessandthendeletethemorotherwiseeliminatetheirnegativeeffects.Inthepreviouschapter,forinstance,wesawthatDBSCANautomaticallyclassifieslow-densitypointsasnoiseandremovesthemfromtheclusteringprocess.Chameleon(Section8.4.4 ),SNNdensity-basedclustering(Section8.4.9 ),andCURE(Section8.5.3 )arethreeofthealgorithmsinthischapterthatexplicitlydealwithnoiseandoutliersduringtheclusteringprocess.

TypeofAttributesandDataSetAsdiscussedinChapter2 ,datasetscanbeofvarioustypes,suchasstructured,graph,orordered,whileattributesareusuallycategorical(nominalorordinal)orquantitative(intervalorratio),andarebinary,discrete,orcontinuous.Differentproximityanddensitymeasuresareappropriatefordifferenttypesofdata.Insomesituations,dataneedstobediscretizedorbinarizedsothatadesiredproximitymeasureorclusteringalgorithmcanbeused.Anothercomplicationoccurswhenattributesareofwidelydifferingtypes,e.g.,continuousandnominal.Insuchcases,proximityanddensityaremoredifficulttodefineandoftenmoreadhoc.Finally,specialdatastructuresandalgorithmsareoftenneededtohandlecertaintypesofdataefficiently.

ScaleDifferentattributes,e.g.,heightandweight,areoftenmeasuredondifferentscales.Thesedifferencescanstronglyaffectthedistanceorsimilaritybetweentwoobjectsand,consequently,theresultsofaclusteranalysis.Considerclusteringagroupofpeoplebasedontheirheights,whicharemeasuredinmeters,andtheirweights,whicharemeasuredinkilograms.IfweuseEuclideandistanceasourproximitymeasure,thenheightwillhavelittleimpactandpeoplewillbeclusteredmostlybasedontheweightattribute.If,however,westandardizeeachattributebysubtractingoffitsmeananddividingbyitsstandarddeviation,thenwewillhaveeliminatedeffectsduetothedifferenceinscale.Moregenerally,normalizationtechniques,suchasthosediscussedinSection2.3.7 ,aretypicallyusedtohandletheseissues.

MathematicalPropertiesoftheDataSpaceSomeclusteringtechniquescalculatethemeanofacollectionofpointsoruseothermathematicaloperationsthatonlymakesenseinEuclideanspaceorinotherspecificdataspaces.Otheralgorithmsrequirethatthedefinitionofdensitybemeaningfulforthedata.

8.1.3ClusterCharacteristics

Thedifferenttypesofclusters,suchasprototype-,graph-,anddensity-based,weredescribedearlierinSection7.1.3 .Here,wedescribeotherimportantcharacteristicsofclusters.

DataDistributionSomeclusteringtechniquesassumeaparticulartypeofdistributionforthedata.Morespecifically,theyoftenassumethatdatacanbemodeledasarisingfromamixtureofdistributions,whereeachclustercorrespondstoadistribution.ClusteringbasedonmixturemodelsisdiscussedinSection8.2.2 .

ShapeSomeclustersareregularlyshaped,e.g.,rectangularorglobular,butingeneral,clusterscanbeofarbitraryshape.TechniquessuchasDBSCANandsinglelinkcanhandleclustersofarbitraryshape,butprototype-basedschemesandsomehierarchicaltechniques,suchascompletelinkandgroupaverage,cannot.Chameleon(Section8.4.4 )andCURE(Section8.5.3 )areexamplesoftechniquesthatwerespecificallydesignedtoaddressthisproblem.

DifferingSizesManyclusteringmethods,suchasK-means,don’tworkwellwhenclustershavedifferentsizes.(SeeSection7.2.4 .)ThistopicisdiscussedfurtherinSection8.6 .

DifferingDensitiesClustersthathavewidelyvaryingdensitycancauseproblemsformethodssuchasDBSCANandK-means.TheSNNdensity-basedclusteringtechniquepresentedinSection8.4.9 addressesthisissue.

PoorlySeparatedClustersWhenclusterstouchoroverlap,someclusteringtechniquescombineclustersthatshouldbekeptseparate.Eventechniquesthatfinddistinctclustersarbitrarilyassignpointstooneclusteroranother.Fuzzyclustering,whichisdescribedinSection8.2.1 ,isonetechniquefordealingwithdatathatdoesnotformwell-separatedclusters.

RelationshipsamongClustersInmostclusteringtechniques,thereisnoexplicitconsiderationoftherelationshipsbetweenclusters,suchastheirrelativeposition.Self-organizingmaps(SOM),whicharedescribedinSection8.2.3 ,areaclusteringtechniquethatdirectlyconsiderstherelationshipsbetweenclustersduringtheclusteringprocess.Specifically,theassignmentofapointtooneclusteraffectsthedefinitionsofnearbyclusters.

SubspaceClustersClustersmayonlyexistinasubsetofdimensions(attributes),andtheclustersdeterminedusingonesetofdimensionsarefrequentlyquitedifferentfromtheclustersdeterminedbyusinganotherset.Whilethisissuecanarisewithasfewastwodimensions,itbecomesmoreacuteasdimensionalityincreases,becausethenumberofpossiblesubsetsofdimensionsisexponentialinthetotalnumberofdimensions.Forthatreason,itisnotfeasibletosimplylookforclustersinallpossiblesubsetsofdimensionsunlessthenumberofdimensionsisrelativelylow.

Oneapproachistoapplyfeatureselection,whichwasdiscussedinSection2.3.4 .However,thisapproachassumesthatthereisonlyonesubsetofdimensionsinwhichtheclustersexist.Inreality,clusterscanexistinmanydistinctsubspaces(setsofdimensions),someofwhichoverlap.Section8.3.2 considerstechniquesthataddressthegeneralproblemofsubspaceclustering,i.e.,offindingbothclustersandthedimensionstheyspan.

8.1.4GeneralCharacteristicsofClusteringAlgorithms

Clusteringalgorithmsarequitevaried.Weprovideageneraldiscussionofimportantcharacteristicsofclusteringalgorithmshere,andmakemorespecificcommentsduringourdiscussionofparticulartechniques.

OrderDependenceForsomealgorithms,thequalityandnumberofclustersproducedcanvary,perhapsdramatically,dependingontheorderinwhichthedataisprocessed.Whileitwouldseemdesirabletoavoidsuchalgorithms,sometimestheorderdependenceisrelativelyminororthealgorithmhasotherdesirablecharacteristics.SOM(Section8.2.3 )isanexampleofanalgorithmthatisorderdependent.

NondeterminismClusteringalgorithms,suchasK-means,arenotorder-dependent,buttheyproducedifferentresultsforeachrunbecausetheyrelyonaninitializationstepthatrequiresarandomchoice.Becausethequalityoftheclusterscanvaryfromoneruntoanother,multiplerunscanbenecessary.

ScalabilityItisnotunusualforadatasettocontainmillionsofobjects,andtheclusteringalgorithmsusedforsuchdatasetsshouldhavelinearornear-lineartimeandspacecomplexity.Evenalgorithmsthathaveacomplexityof

arenotpracticalforlargedatasets.Furthermore,clusteringtechniquesfordatasetscannotalwaysassumethatallthedatawillfitinmainmemoryorthatdataelementscanberandomlyaccessed.Suchalgorithmsareinfeasibleforlargedatasets.Section8.5 isdevotedtotheissueofscalability.

ParameterSelectionMostclusteringalgorithmshaveoneormoreparametersthatneedtobesetbytheuser.Itcanbedifficulttochoosethe

O(m2)

propervalues;thus,theattitudeisusually,“thefewerparameters,thebetter.”Choosingparametervaluesbecomesevenmorechallengingifasmallchangeintheparametersdrasticallychangestheclusteringresults.Finally,unlessaprocedure(whichmightinvolveuserinput)isprovidedfordeterminingparametervalues,auserofthealgorithmisreducedtousingtrialanderrortofindsuitableparametervalues.

Perhapsthemostwell-knownparameterselectionproblemisthatof“choosingtherightnumberofclusters”forpartitionalclusteringalgorithms,suchasK-means.OnepossibleapproachtothatissueisgiveninSection7.5.5 ,whilereferencestoothersareprovidedintheBibliographicNotes.

TransformingtheClusteringProblemtoAnotherDomainOneapproachtakenbysomeclusteringtechniquesistomaptheclusteringproblemtoaprobleminadifferentdomain.Graph-basedclustering,forinstance,mapsthetaskoffindingclusterstothetaskofpartitioningaproximitygraphintoconnectedcomponents.

TreatingClusteringasanOptimizationProblemClusteringisoftenviewedasanoptimizationproblem:dividethepointsintoclustersinawaythatmaximizesthegoodnessoftheresultingsetofclustersasmeasuredbyauser-specifiedobjectivefunction.Forexample,theK-meansclusteringalgorithm(Section7.2 )triestofindthesetofclustersthatminimizesthesumofthesquareddistanceofeachpointfromitsclosestclustercentroid.Intheory,suchproblemscanbesolvedbyenumeratingallpossiblesetsofclustersandselectingtheonewiththebestvalueoftheobjectivefunction,butthisexhaustiveapproachiscomputationallyinfeasible.Forthisreason,manyclusteringtechniquesarebasedonheuristicapproachesthatproducegood,butnotoptimalclusterings.Anotherapproachistouseobjectivefunctionsonagreedyorlocalbasis.Inparticular,thehierarchicalclusteringtechniques

discussedinSection7.3 proceedbymakinglocallyoptimal(greedy)decisionsateachstepoftheclusteringprocess.

RoadMapWearrangeourdiscussionofclusteringalgorithmsinamannersimilartothatofthepreviouschapter,groupingtechniquesprimarilyaccordingtowhethertheyareprototype-based,density-based,orgraph-based.Thereis,however,aseparatediscussionforscalableclusteringtechniques.Weconcludethischapterwithadiscussionofhowtochooseaclusteringalgorithm.

8.2Prototype-BasedClusteringInprototype-basedclustering,aclusterisasetofobjectsinwhichanyobjectisclosertotheprototypethatdefinestheclusterthantotheprototypeofanyothercluster.Section7.2 describedK-means,asimpleprototype-basedclusteringalgorithmthatusesthecentroidoftheobjectsinaclusterastheprototypeofthecluster.Thissectiondiscussesclusteringapproachesthatexpandontheconceptofprototype-basedclusteringinoneormoreways,asdiscussednext:

Objectsareallowedtobelongtomorethanonecluster.Morespecifically,anobjectbelongstoeveryclusterwithsomeweight.Suchanapproachaddressesthefactthatsomeobjectsareequallyclosetoseveralclusterprototypes.Aclusterismodeledasastatisticaldistribution,i.e.,objectsaregeneratedbyarandomprocessfromastatisticaldistributionthatischaracterizedbyanumberofstatisticalparameters,suchasthemeanandvariance.Thisviewpointgeneralizesthenotionofaprototypeandenablestheuseofwell-establishedstatisticaltechniques.Clustersareconstrainedtohavefixedrelationships.Mostcommonly,theserelationshipsareconstraintsthatspecifyneighborhoodrelationships;i.e.,thedegreetowhichtwoclustersareneighborsofeachother.Constrainingtherelationshipsamongclusterscansimplifytheinterpretationandvisualizationofthedata.

Weconsiderthreespecificclusteringalgorithmstoillustratetheseextensionsofprototype-basedclustering.Fuzzyc-meansusesconceptsfromthefieldoffuzzylogicandfuzzysettheorytoproposeaclusteringscheme,whichismuchlikeK-means,butwhichdoesnotrequireahardassignmentofapoint

toonlyonecluster.Mixturemodelclusteringtakestheapproachthatasetofclusterscanbemodeledasamixtureofdistributions,oneforeachcluster.TheclusteringschemebasedonSelf-OrganizingMaps(SOM)performsclusteringwithinaframeworkthatrequiresclusterstohaveaprespecifiedrelationshiptooneanother,e.g.,atwo-dimensionalgridstructure.

8.2.1FuzzyClustering

Ifdataobjectsaredistributedinwell-separatedgroups,thenacrispclassificationoftheobjectsintodisjointclustersseemslikeanidealapproach.However,inmostcases,theobjectsinadatasetcannotbepartitionedintowell-separatedclusters,andtherewillbeacertainarbitrarinessinassigninganobjecttoaparticularcluster.Consideranobjectthatliesneartheboundaryoftwoclusters,butisslightlyclosertooneofthem.Inmanysuchcases,itmightbemoreappropriatetoassignaweighttoeachobjectandeachclusterthatindicatesthedegreetowhichtheobjectbelongstothecluster.Mathematically, istheweightwithwhichobject belongstocluster .

Asshowninthenextsection,probabilisticapproachescanalsoprovidesuchweights.Whileprobabilisticapproachesareusefulinmanysituations,therearetimeswhenitisdifficulttodetermineanappropriatestatisticalmodel.Insuchcases,non-probabilisticclusteringtechniquesareneededtoprovidesimilarcapabilities.Fuzzyclusteringtechniquesarebasedonfuzzysettheoryandprovideanaturaltechniqueforproducingaclusteringinwhichmembershipweights(the )haveanatural(butnotprobabilistic)interpretation.Thissectiondescribesthegeneralapproachoffuzzyclusteringandprovidesaspecificexampleintermsoffuzzyc-means(fuzzyK-means).

FuzzySets

wij xi Cj

wij

LotfiZadehintroducedfuzzysettheoryandfuzzylogicin1965asawayofdealingwithimprecisionanduncertainty.Briefly,fuzzysettheoryallowsanobjecttobelongtoasetwithadegreeofmembershipbetween0and1,whilefuzzylogicallowsastatementtobetruewithadegreeofcertaintybetween0and1.Traditionalsettheoryandlogicarespecialcasesoftheirfuzzycounterpartsthatrestrictthedegreeofsetmembershiporthedegreeofcertaintytobeeither0or1.Fuzzyconceptshavebeenappliedtomanydifferentareas,includingcontrolsystems,patternrecognition,anddataanalysis(classificationandclustering).

Considerthefollowingexampleoffuzzylogic.Thedegreeoftruthofthestatement“Itiscloudy”canbedefinedtobethepercentageofcloudcoverinthesky,e.g.,iftheskyis50%coveredbyclouds,thenwewouldassign“Itiscloudy”adegreeoftruthof0.5.Ifwehavetwosets,“cloudydays”and“non-cloudydays,”thenwecansimilarlyassigneachdayadegreeofmembershipinthetwosets.Thus,ifadaywere25%cloudy,itwouldhavea25%degreeofmembershipin“cloudydays”anda75%degreeofmembershipin“non-cloudydays.”

FuzzyClustersAssumethatwehaveasetofdatapoints ,whereeachpoint, ,isann-dimensionalpoint,i.e., .Acollectionoffuzzyclusters,

isasubsetofallpossiblefuzzysubsetsof .(Thissimplymeansthatthemembershipweights(degrees), ,havebeenassignedvaluesbetween0and1foreachpoint, ,andeachcluster, .)However,wealsowanttoimposethefollowingreasonableconditionsontheclustersinordertoensurethattheclustersformwhatiscalledafuzzypseudo-partition.

1. Alltheweightsforagivenpoint, ,addupto1.

X={x1,…,xm} xixi=(xi1,…,xin)

C1,C2,…,  Ck Xwij

xi Cj

xi

2. Eachcluster, ,contains,withnon-zeroweight,atleastonepoint,butdoesnotcontain,withaweightofone,allofthepoints.

Fuzzyc-meansWhiletherearemanytypesoffuzzyclustering—indeed,manydataanalysisalgorithmscanbe“fuzzified”—weonlyconsiderthefuzzyversionofK-means,whichiscalledfuzzyc-means.Intheclusteringliterature,theversionofK-meansthatdoesnotuseincrementalupdatesofclustercentroidsissometimesreferredtoasc-means,andthiswasthetermadaptedbythefuzzycommunityforthefuzzyversionofK-means.Thefuzzyc-meansalgorithm,alsosometimesknownasFCM,isgivenbyAlgorithm8.1 .

Algorithm8.1Basicfuzzyc-means

algorithm.

∑j=1kwij=1Cj

0<∑i=1mwij<m

1:Selectaninitialfuzzypseudo-partition,i.e.,assignvaluestoallthe .2:repeat3:Computethecentroidofeachclusterusingthefuzzypseudo-partition.4:Recomputethefuzzypseudo-partition,i.e.,the .5:untilThecentroidsdon’tchange.(Alternativestoppingconditionsare“ifthechangeintheerrorisbelowaspecifiedthreshold”or“iftheabsolutechangeinanyisbelowagiventhreshold.”)

wij

wij

wij

Afterinitialization,FCMrepeatedlycomputesthecentroidsofeachclusterandthefuzzypseudo-partitionuntilthepartitiondoesnotchange.FCMissimilarinstructuretotheK-meansalgorithm,whichafterinitialization,alternatesbetweenastepthatupdatesthecentroidsandastepthatassignseachobjecttotheclosestcentroid.Specifically,computingafuzzypseudo-partitionisequivalenttotheassignmentstep.AswithK-means,FCMcanbeinterpretedasattemptingtominimizethesumofthesquarederror(SSE),althoughFCMisbasedonafuzzyversionofSSE.Indeed,K-meanscanberegardedasaspecialcaseofFCMandthebehaviorofthetwoalgorithmsisquitesimilar.ThedetailsofFCMaredescribedbelow.

ComputingSSE

Thedefinitionofthesumofthesquarederror(SSE)ismodifiedasfollows:

where isthecentroidofthe clusterandp,whichistheexponentthatdeterminestheinfluenceoftheweights,hasavaluebetween1and∞.NotethatthisSSEisjustaweightedversionofthetraditionalK-meansSSEgiveninEquation7.1 .

Initialization

Randominitializationisoftenused.Inparticular,weightsarechosenrandomly,subjecttotheconstraintthattheweightsassociatedwithanyobjectmustsumto1.AswithK-means,randominitializationissimple,butoftenresultsinaclusteringthatrepresentsalocalminimumintermsoftheSSE.Section7.2.1 ,whichcontainsadiscussiononchoosinginitialcentroidsforK-means,hasconsiderablerelevanceforFCMaswell.

SSE(C1,C2,…,Ck)=∑j=1k∑i=1mwijpdist(xi,cj)2 (8.1)

cj jth

ComputingCentroids

ThedefinitionofthecentroidgiveninEquation8.2 canbederivedbyfindingthecentroidthatminimizesthefuzzySSEasgivenbyEquation8.1 .(SeetheapproachinSection7.2.6 .)Foracluster, ,thecorrespondingcentroid, ,isdefinedbythefollowingequation:

Thefuzzycentroiddefinitionissimilartothetraditionaldefinitionexceptthatallpointsareconsidered(anypointcanbelongtoanycluster,atleastsomewhat)andthecontributionofeachpointtothecentroidisweightedbyitsmembershipdegree.Inthecaseoftraditionalcrispsets,whereall areeither0or1,thisdefinitionreducestothetraditionaldefinitionofacentroid.

Thereareafewconsiderationswhenchoosingthevalueofp.Choosingsimplifiestheweightupdateformula—seeEquation8.4 .However,ifpischosentobenear1,thenfuzzyc-meansbehavesliketraditionalK-means.Goingintheotherdirection,aspgetslarger,alltheclustercentroidsapproachtheglobalcentroidofallthedatapoints.Inotherwords,thepartitionbecomesfuzzieraspincreases.

UpdatingtheFuzzyPseudo-partition

Becausethefuzzypseudo-partitionisdefinedbytheweight,thisstepinvolvesupdatingtheweights associatedwiththe pointand cluster.TheweightupdateformulagiveninEquation8.3 canbederivedbyminimizingtheSSEofEquation8.1 subjecttotheconstraintthattheweightssumto1.

Cjcj

cj=∑i=1mwijpxi/∑i=1mwijp (8.2)

wij

p=2

wij ith jth

wij=(1/dist(xi,cj)2)1p−1/∑q=1k(1/dist(xi,cq)2)1p−1 (8.3)

Thisformulamightappearabitmysterious.However,notethatif ,thenweobtainEquation8.4 ,whichissomewhatsimpler.WeprovideanintuitiveexplanationofEquation8.4 ,which,withaslightmodification,alsoappliestoEquation8.3 .

Intuitively,theweight ,whichindicatesthedegreeofmembershipofpointincluster ,shouldberelativelyhighif isclosetocentroid (if islow)andrelativelylowif isfarfromcentroid (if ishigh).If

,whichisthenumeratorofEquation8.4 ,thenthiswillindeedbethecase.However,themembershipweightsforapointwillnotsumtooneunlesstheyarenormalized;i.e.,dividedbythesumofalltheweightsasinEquation8.4 .Tosummarize,themembershipweightofapointinaclusterisjustthereciprocalofthesquareofthedistancebetweenthepointandtheclustercentroiddividedbythesumofallthemembershipweightsofthepoint.

Nowconsidertheimpactoftheexponent inEquation8.3 .If ,thenthisexponentdecreasestheweightassignedtoclustersthatareclosetothepoint.Indeed,aspgoestoinfinity,theexponenttendsto0andweightstendtothevalue1/k.Ontheotherhand,aspapproaches1,theexponentincreasesthemembershipweightsofpointstowhichtheclusterisclose.Aspgoesto1,themembershipweightgoesto1fortheclosestclusterandto0foralltheotherclusters.ThiscorrespondstoK-means.

Example8.1(Fuzzyc-meansonThreeCircularClusters).Figure8.1 showstheresultofapplyingfuzzyc-meanstofindthreeclustersforatwo-dimensionaldatasetof100points.Eachpointwas

p=2

wij=1/dist(xi,cj)2/∑q=1k1/dist(xi,cq)2 (8.4)

wij xiCj xi cj dist(xi,cj)

xi cj dist(xi,cj)wij=1/dist(xi,cj)2

1/(p−1) p>2

assignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Thedarkerthepoints,thestrongertheirmembershipintheclustertowhichtheyhavebeenassigned.Themembershipinaclusterisstrongesttowardthecenteroftheclusterandweakestforthosepointsthatarebetweenclusters.

Figure8.1.Fuzzyc-meansclusteringofatwo-dimensionalpointset.

StrengthsandLimitations

ApositivefeatureofFCMisthatitproducesaclusteringthatprovidesanindicationofthedegreetowhichanypointbelongstoanycluster.Otherwise,ithasmuchthesamestrengthsandweaknessesasK-means,althoughitissomewhatmorecomputationallyintensive.

8.2.2ClusteringUsingMixtureModels

Thissectionconsidersclusteringbasedonstatisticalmodels.Itisoftenconvenientandeffectivetoassumethatdatahasbeengeneratedasaresultofastatisticalprocessandtodescribethedatabyfindingthestatisticalmodelthatbestfitsthedata,wherethestatisticalmodelisdescribedintermsofadistributionandasetofparametersforthatdistribution.Atahighlevel,thisprocessinvolvesdecidingonastatisticalmodelforthedataandestimatingtheparametersofthatmodelfromthedata.Thissectiondescribesaparticularkindofstatisticalmodel,mixturemodels,whichmodelthedatabyusinganumberofstatisticaldistributions.Eachdistributioncorrespondstoaclusterandtheparametersofeachdistributionprovideadescriptionofthecorrespondingcluster,typicallyintermsofitscenterandspread.

Thediscussioninthissectionproceedsasfollows.Afterprovidingadescriptionofmixturemodels,weconsiderhowparameterscanbeestimatedforstatisticaldatamodels.Wefirstdescribehowaprocedureknownasmaximumlikelihoodestimation(MLE)canbeusedtoestimateparametersforsimplestatisticalmodelsandthendiscusshowwecanextendthisapproachforestimatingtheparametersofmixturemodels.Specifically,wedescribethewell-knownExpectation-Maximization(EM)algorithm,whichmakesaninitialguessfortheparameters,andtheniterativelyimprovestheseestimates.WepresentexamplesofhowtheEMalgorithmcanbeusedto

clusterdatabyestimatingtheparametersofamixturemodelanddiscussitsstrengthsandlimitations.

Afirmunderstandingofstatisticsandprobability,ascoveredinAppendixC,isessentialforunderstandingthissection.Also,forconvenienceinthefollowingdiscussion,weusethetermprobabilitytorefertobothprobabilityandprobabilitydensity.

MixtureModelsMixturemodelsviewthedataasasetofobservationsfromamixtureofdifferentprobabilitydistributions.Theprobabilitydistributionscanbeanything,butareoftentakentobemultivariatenormal,asthistypeofdistributioniswellunderstood,mathematicallyeasytoworkwith,andhasbeenshowntoproducegoodresultsinmanyinstances.Thesetypesofdistributionscanmodelellipsoidalclusters.

Conceptually,mixturemodelscorrespondtothefollowingprocessofgeneratingdata.Givenseveraldistributions,usuallyofthesametype,butwithdifferentparameters,randomlyselectoneofthesedistributionsandgenerateanobjectfromit.Repeattheprocessmtimes,wheremisthenumberofobjects.

Moreformally,assumethatthereareKdistributionsandmobjects,.Letthe distributionhaveparameters ,andlet bethesetofall

parameters,i.e., .Then, istheprobabilityoftheobjectifitcomesfromthe distribution.Theprobabilitythatthedistributionischosentogenerateanobjectisgivenbytheweight ,wheretheseweights(probabilities)aresubjecttotheconstraintthattheysumtoone,i.e., .Then,theprobabilityofanobjectxisgivenbyEquation8.5 .

X={x1,…,xm} jth θj Θ

Θ={θ1,…,θK} prob(xi|θj) ithjth jth

wj,1≤j≤K

∑j=1Kwj=1

Iftheobjectsaregeneratedinanindependentmanner,thentheprobabilityoftheentiresetofobjectsisjusttheproductoftheprobabilitiesofeachindividual .

Formixturemodels,eachdistributiondescribesadifferentgroup,i.e.,adifferentcluster.Byusingstatisticalmethods,wecanestimatetheparametersofthesedistributionsfromthedataandthusdescribethesedistributions(clusters).Wecanalsoidentifywhichobjectsbelongtowhichclusters.However,mixturemodelingdoesnotproduceacrispassignmentofobjectstoclusters,butrathergivestheprobabilitywithwhichaspecificobjectbelongstoaparticularcluster.

Example8.2(UnivariateGaussianMixture).WeprovideaconcreteillustrationofamixturemodelintermsofGaussiandistributions.Theprobabilitydensityfunctionforaone-dimensionalGaussiandistributionatapointxis

TheparametersoftheGaussiandistributionaregivenby ,whereμisthemeanofthedistributionandσisthestandarddeviation.AssumethattherearetwoGaussiandistributions,withacommonstandarddeviationof2andmeansof and4,respectively.Alsoassumethateachofthetwodistributionsisselectedwithequalprobability,i.e., .ThenEquation8.5 becomesthefollowing:

prob(x|Θ)=∑j=1Kwjpj(x|θj) (8.5)

xi

prob(X|Θ)=∏i=1mprob(xi|Θ)=∏i=1m∑j=1Kwjpj(xi|θj) (8.6)

prob(x|Θ)=12πσe−(x−μ)22σ2. (8.7)

θ=(μ, σ)

−4w1=w2=0.5

Figure8.2(a) showsaplotoftheprobabilitydensityfunctionofthismixturemodel,whileFigure8.2(b) showsthehistogramfor20,000pointsgeneratedfromthismixturemodel.

Figure8.2.Mixturemodelconsistingoftwonormaldistributionswithmeansof-4and4,respectively.Bothdistributionshaveastandarddeviationof2.

EstimatingModelParametersUsingMaximumLikelihoodGivenastatisticalmodelforthedata,itisnecessarytoestimatetheparametersofthatmodel.Astandardapproachusedforthistaskismaximumlikelihoodestimation,whichwenowexplain.

Considerasetofmpointsthataregeneratedfromaone-dimensionalGaussiandistribution.Assumingthatthepointsaregeneratedindependently,

prob(x|Θ)=122πe−(x+4)28+122πe−(x−4)28. (8.8)

theprobabilityofthesepointsisjusttheproductoftheirindividualprobabilities.(Again,wearedealingwithprobabilitydensities,buttokeepourterminologysimple,wewillrefertoprobabilities.)UsingEquation8.7 ,wecanwritethisprobabilityasshowninEquation8.9 .Becausethisprobabilitywouldbeaverysmallnumber,wetypicallywillworkwiththelogprobability,asshowninEquation8.10 .

Wewouldliketofindaproceduretoestimateμandσiftheyareunknown.Oneapproachistochoosethevaluesoftheparametersforwhichthedataismostprobable(mostlikely).Inotherwords,choosetheμandσthatmaximizeEquation8.9 .Thisapproachisknowninstatisticsasthemaximumlikelihoodprinciple,andtheprocessofapplyingthisprincipletoestimatetheparametersofastatisticaldistributionfromthedataisknownasmaximumlikelihoodestimation(MLE).

Theprincipleiscalledthemaximumlikelihoodprinciplebecause,givenasetofdata,theprobabilityofthedata,regardedasafunctionoftheparameters,iscalledalikelihoodfunction.Toillustrate,werewriteEquation8.9 asEquation8.11 toemphasizethatweviewthestatisticalparametersμandσasourvariablesandthatthedataisregardedasaconstant.Forpracticalreasons,theloglikelihoodismorecommonlyused.TheloglikelihoodfunctionderivedfromthelogprobabilityofEquation8.10 isshowninEquation8.12 .Notethattheparametervaluesthatmaximizetheloglikelihoodalsomaximizethelikelihoodsincelogisamonotonicallyincreasingfunction.

prob(X|Θ)=∏i=1m12πσe−(xi−μ)22σ2 (8.9)

log prob(X|Θ)=−∑i=1m(xi−μ)22σ2−0.5m log2π−mlogσ (8.10)

likelihood(Θ|X)=L(Θ|X)=∏i=1m12πσe−(xi−μ)22σ2 (8.11)

Example8.3(MaximumLikelihoodParameterEstimation).WeprovideaconcreteillustrationoftheuseofMLEforfindingparametervalues.Supposethatwehavethesetof200pointswhosehistogramisshowninFigure8.3(a) .Figure8.3(b) showsthemaximumloglikelihoodplotforthe200pointsunderconsideration.Thevaluesoftheparametersforwhichthelogprobabilityisamaximumare and

,whichareclosetotheparametervaluesoftheunderlyingGaussiandistribution, and .

Figure8.3.200pointsfromaGaussiandistributionandtheirlogprobabilityfordifferentparametervalues.

Graphingthelikelihoodofthedatafordifferentvaluesoftheparametersisnotpractical,atleastiftherearemorethantwoparameters.Thus,standard

log  likelihood(Θ|X)=ℓ(Θ|X)=−∑i=1m(xi−μ)22σ2−0.5mlog2π−mlogσ (8.12)

μ=−4.1σ=2.1

μ=−4.0 σ=2.0

statisticalprocedureistoderivethemaximumlikelihoodestimatesofastatisticalparameterbytakingthederivativeoflikelihoodfunctionwithrespecttothatparameter,settingtheresultequalto0,andsolving.Inparticular,foraGaussiandistribution,itcanbeshownthatthemeanandstandarddeviationofthesamplepointsarethemaximumlikelihoodestimatesofthecorrespondingparametersoftheunderlyingdistribution.(SeeExercise9on700.)Indeed,forthe200pointsconsideredinourexample,theparametervaluesthatmaximizedtheloglikelihoodwerepreciselythemeanandstandarddeviationofthe200points,i.e., and .

EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithmWecanalsousethemaximumlikelihoodapproachtoestimatethemodelparametersforamixturemodel.Inthesimplestcase,weknowwhichdataobjectscomefromwhichdistributions,andthesituationreducestooneofestimatingtheparametersofasingledistributiongivendatafromthatdistribution.Formostcommondistributions,themaximumlikelihoodestimatesoftheparametersarecalculatedfromsimpleformulasinvolvingthedata.

Inamoregeneral(andmorerealistic)situation,wedonotknowwhichpointsweregeneratedbywhichdistribution.Thus,wecannotdirectlycalculatetheprobabilityofeachdatapoint,andhence,itwouldseemthatwecannotusethemaximumlikelihoodprincipletoestimateparameters.ThesolutiontothisproblemistheEMalgorithm,whichisshowninAlgorithm8.2 .Briefly,givenaguessfortheparametervalues,theEMalgorithmcalculatestheprobabilitythateachpointbelongstoeachdistributionandthenusestheseprobabilitiestocomputeanewestimatefortheparameters.(Theseparametersaretheonesthatmaximizethelikelihood.)Thisiterationcontinuesuntiltheestimatesoftheparameterseitherdonotchangeorchangevery

μ=−4.1 σ=2.1

little.Thus,westillemploymaximumlikelihoodestimation,butviaaniterativesearch.

Algorithm8.2EMalgorithm.

TheEMalgorithmissimilartotheK-meansalgorithmgiveninSection7.2.1 .Indeed,theK-meansalgorithmforEuclideandataisaspecialcaseoftheEMalgorithmforsphericalGaussiandistributionswithequalcovariancematrices,butdifferentmeans.TheexpectationstepcorrespondstotheK-meansstepofassigningeachobjecttoacluster.Instead,eachobjectisassignedtoeverycluster(distribution)withsomeprobability.Themaximizationstepcorrespondstocomputingtheclustercentroids.Instead,alltheparametersofthedistributions,aswellastheweightparameters,areselectedtomaximizethelikelihood.Thisprocessisoftenstraightforward,astheparametersaretypicallycomputedusingformulasderivedfrommaximumlikelihoodestimation.Forinstance,forasingleGaussiandistribution,theMLEestimateofthemeanisthemeanoftheobjectsinthedistribution.Inthe

1:Selectaninitialsetofmodelparameters.(AswithK-means,thiscanbedonerandomlyorinavarietyofways.)2:repeat3:ExpectationStepForeachobject,calculatetheprobabilitythateachobjectbelongstoeachdistribution,i.e.,calculate

.4:MaximizationStepGiventheprobabilitiesfromtheexpectationstep,findthenewestimatesoftheparametersthatmaximizetheexpectedlikelihood.5:untilTheparametersdonotchange.(Alternatively,stopifthechangeintheparametersisbelowaspecifiedthreshold.)

prob(distribution j|xi,Θ)

contextofmixturemodelsandtheEMalgorithm,thecomputationofthemeanismodifiedtoaccountforthefactthateveryobjectbelongstoadistributionwithacertainprobability.Thisisillustratedfurtherinthefollowingexample.

Example8.4(SimpleExampleofEMAlgorithm).ThisexampleillustrateshowEMoperateswhenappliedtothedatainFigure8.2 .Tokeeptheexampleassimpleaspossible,weassumethatweknowthatthestandarddeviationofbothdistributionsis2.0andthatpointsweregeneratedwithequalprobabilityfrombothdistributions.Wewillrefertotheleftandrightdistributionsasdistributions1and2,respectively.

WebegintheEMalgorithmbymakinginitialguessesfor and ,say,and .Thus,theinitialparameters, ,forthetwo

distributionsare,respectively, and .Thesetofparametersfortheentiremixturemodelis .FortheexpectationstepofEM,wewanttocomputetheprobabilitythatapointcamefromaparticulardistribution;i.e.,wewanttocompute and

.ThesevaluescanbeexpressedbyEquation8.13 ,whichisastraightforwardapplicationofBayesrule,whichisdescribedinAppendixC.

where0.5istheprobability(weight)ofeachdistributionandjis1or2.

Forinstance,assumeoneofthepointsis0.UsingtheGaussiandensityfunctiongiveninEquation8.7 ,wecomputethat and

.(Again,wearereallycomputingprobabilitydensities.)UsingthesevaluesandEquation8.13 ,wefindthat

μ1 μ2μ1=−2 μ2=3 θ=(μ,σ)

θ1=(−2,2) θ2=(3,2)Θ={θ1,θ2}

prob(distribution 1|xi,Θ)prob(distribution 2|xi,Θ)

prob(distribution j|xi,θ)=0.5prob(xi|θj)0.5 prob(xi|θ1)+0.5 prob(xi|θ2), (8.13)

prob(0|θ1)=0.12prob(0|θ2)=0.06

and.Thismeansthatthepoint

0istwiceaslikelytobelongtodistribution1asdistribution2basedonthecurrentassumptionsfortheparametervalues.

Aftercomputingtheclustermembershipprobabilitiesforall20,000points,wecomputenewestimatesfor and (usingEquations8.14 and8.15 )inthemaximizationstepoftheEMalgorithm.Noticethatthenewestimateforthemeanofadistributionisjustaweightedaverageofthepoints,wheretheweightsaretheprobabilitiesthatthepointsbelongtothedistribution,i.e.,the values.

Werepeatthesetwostepsuntiltheestimatesof and eitherdon’tchangeorchangeverylittle.Table8.1 givesthefirstfewiterationsoftheEMalgorithmwhenitisappliedtothesetof20,000points.Forthisdata,weknowwhichdistributiongeneratedwhichpoint,sowecanalsocomputethemeanofthepointsfromeachdistribution.Themeansare

and .

Table8.1.FirstfewiterationsoftheEMalgorithmforthesimpleexample.

Iteration

0 3.00

1 4.10

2 4.07

prob(distribution 1|0,Θ)=0.12/(0.12+0.06)=0.66prob(distribution 2|0,Θ)=0.06/(0.12+0.06)=0.33

μ1 μ2

prob(distribution j|xi)

μ1=∑i=120,000xiprob(distribution 1|xi,Θ)∑i=120,000prob(distribution 1|xi,Θ)(8.14)

μ2=∑i=120,000xiprob(distribution 2|xi,Θ)∑i=120,000prob(distribution 2|xi,Θ)(8.15)

μ1 μ2

μ1=−3.98 μ2=4.03

μ1 μ2

−2.00

−3.74

−3.94

3 4.04

4 4.03

5 4.03

Example8.5(TheEMAlgorithmonSampleDataSets).WegivethreeexamplesthatillustratetheuseoftheEMalgorithmtofindclustersusingmixturemodels.Thefirstexampleisbasedonthedatasetusedtoillustratethefuzzyc-meansalgorithm—seeFigure8.1 .Wemodeledthisdataasamixtureofthreetwo-dimensionalGaussiandistributionswithdifferentmeansandidenticalcovariancematrices.WethenclusteredthedatausingtheEMalgorithm.TheresultsareshowninFigure8.4 .Eachpointwasassignedtotheclusterinwhichithadthelargestmembershipweight.Thepointsbelongingtoeachclusterareshownbydifferentmarkershapes,whilethedegreeofmembershipintheclusterisshownbytheshading.Membershipinaclusterisrelativelyweakforthosepointsthatareontheborderofthetwoclusters,butstrongelsewhere.ItisinterestingtocomparethemembershipweightsandprobabilitiesofFigures8.4 and8.1 .(SeeExercise11 onpage700.)

−3.97

−3.98

−3.98

Figure8.4.EMclusteringofatwo-dimensionalpointsetwiththreeclusters.

Foroursecondexample,weapplymixturemodelclusteringtodatathatcontainsclusterswithdifferentdensities.Thedataconsistsoftwonaturalclusters,eachwithroughly500points.ThisdatawascreatedbycombiningtwosetsofGaussiandata,onewithacenterat andastandarddeviationof2,andonewithacenterat(0,0)andastandarddeviationof0.5.Figure8.5 showstheclusteringproducedbytheEMalgorithm.Despitethedifferencesinthedensity,theEMalgorithmisquitesuccessfulatidentifyingtheoriginalclusters.

(−4,1)

Figure8.5.EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.

Forourthirdexample,weusemixturemodelclusteringonadatasetthatK-meanscannotproperlyhandle.Figure8.6(a) showstheclusteringproducedbyamixturemodelalgorithm,whileFigure8.6(b) showstheK-meansclusteringofthesamesetof1,000points.Formixturemodelclustering,eachpointhasbeenassignedtotheclusterforwhichithasthehighestprobability.Inbothfigures,differentmarkersareusedtodistinguishdifferentclusters.Donotconfusethe‘+’and‘x’markersinFigure8.6(a) .

Figure8.6.MixturemodelandK-meansclusteringofasetoftwo-dimensionalpoints.

AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithmFindingclustersbymodelingthedatausingmixturemodelsandapplyingtheEMalgorithmtoestimatetheparametersofthosemodelshasavarietyofadvantagesanddisadvantages.Onthenegativeside,theEMalgorithmcanbeslow,itisnotpracticalformodelswithlargenumbersofcomponents,anditdoesnotworkwellwhenclusterscontainonlyafewdatapointsorifthedatapointsarenearlyco-linear.Thereisalsoaprobleminestimatingthenumberofclustersor,moregenerally,inchoosingtheexactformofthemodeltouse.ThisproblemtypicallyhasbeendealtwithbyapplyingaBayesianapproach,which,roughlyspeaking,givestheoddsofonemodelversusanother,basedonanestimatederivedfromthedata.Mixturemodelscanalsohavedifficultywithnoiseandoutliers,althoughworkhasbeendonetodealwiththisproblem.

Onthepositiveside,mixturemodelsaremoregeneralthanK-meansorfuzzyc-meansbecausetheycanusedistributionsofvarioustypes.Asaresult,mixturemodels(basedonGaussiandistributions)canfindclustersofdifferentsizesandellipticalshapes.Also,amodel-basedapproachprovidesadisciplinedwayofeliminatingsomeofthecomplexityassociatedwithdata.Toseethepatternsindata,itisoftennecessarytosimplifythedata,andfittingthedatatoamodelisagoodwaytodothatifthemodelisagoodmatchforthedata.Furthermore,itiseasytocharacterizetheclustersproduced,becausetheycanbedescribedbyasmallnumberofparameters.Finally,manysetsofdataareindeedtheresultofrandomprocesses,andthusshouldsatisfythestatisticalassumptionsofthesemodels.

8.2.3Self-OrganizingMaps(SOM)

TheKohonenSelf-OrganizingFeatureMap(SOFMorSOM)isaclusteringanddatavisualizationtechniquebasedonaneuralnetworkviewpoint.DespitetheneuralnetworkoriginsofSOM,itismoreeasilypresented—atleastinthecontextofthischapter—asavariationofprototype-basedclustering.Aswithothertypesofcentroid-basedclustering,thegoalofSOMistofindasetofcentroids(referencevectorsinSOMterminology)andtoassigneachobjectinthedatasettothecentroidthatprovidesthebestapproximationofthatobject.Inneuralnetworkterminology,thereisoneneuronassociatedwitheachcentroid.

AswithincrementalK-means,dataobjectsareprocessedoneatatimeandtheclosestcentroidisupdated.UnlikeK-means,SOMimposesatopographicorderingonthecentroidsandnearbycentroidsarealsoupdated.Furthermore,SOMdoesnotkeeptrackofthecurrentclustermembershipofanobject,and,unlikeK-means,ifanobjectswitchesclusters,thereisnoexplicitupdateoftheoldclustercentroid.However,iftheoldclusterisintheneighborhoodofthenewcluster,itwillbeupdated.Theprocessingofpointscontinuesuntilsomepredeterminedlimitisreachedorthecentroidsarenotchangingverymuch.ThefinaloutputoftheSOMtechniqueisasetofcentroidsthatimplicitlydefineclusters.Eachclusterconsistsofthepointsclosesttoaparticularcentroid.Thefollowingsectionexploresthedetailsofthisprocess.

TheSOMAlgorithmAdistinguishingfeatureofSOMisthatitimposesatopographic(spatial)organizationonthecentroids(neurons).Figure8.7 showsanexampleofatwo-dimensionalSOMinwhichthecentroidsarerepresentedbynodesthat

areorganizedinarectangularlattice.Eachcentroidisassignedapairofcoordinates(i,j).Sometimes,suchanetworkisdrawnwithlinksbetweenadjacentnodes,butthatcanbemisleadingbecausetheinfluenceofonecentroidonanotherisviaaneighborhoodthatisdefinedintermsofcoordinates,notlinks.TherearemanytypesofSOMneuralnetworks,butwerestrictourdiscussiontotwo-dimensionalSOMswitharectangularorhexagonalorganizationofthecentroids.

Figure8.7.Two-dimensional3-by-3rectangularSOMneuralnetwork.

EventhoughSOMissimilartoK-meansorotherprototype-basedapproaches,thereisafundamentaldifference.CentroidsusedinSOMhaveapredeterminedtopographicorderingrelationship.Duringthetrainingprocess,SOMuseseachdatapointtoupdatetheclosestcentroidandcentroidsthatarenearbyinthetopographicordering.Inthisway,SOMproducesanorderedsetofcentroidsforanygivendataset.Inotherwords,thecentroidsthatareclosetoeachotherintheSOMgridaremorecloselyrelatedtoeachotherthantothecentroidsthatarefartheraway.Becauseofthisconstraint,thecentroidsofatwo-dimensionalSOMcanbeviewedaslyingonatwo-dimensionalsurfacethattriestofitthen-dimensionaldataaswellaspossible.TheSOMcentroidscanalsobethoughtofastheresultofanonlinearregressionwithrespecttothedatapoints.

Atahighlevel,clusteringusingtheSOMtechniqueconsistsofthestepsdescribedinAlgorithm8.3 .

Algorithm8.3BasicSOMAlgorithm.

Initialization

Thisstep(line1)canbeperformedinanumberofways.Oneapproachistochooseeachcomponentofacentroidrandomlyfromtherangeofvaluesobservedinthedataforthatcomponent.Whilethisapproachworks,itisnotnecessarilythebestapproach,especiallyforproducingrapidconvergence.Anotherapproachistorandomlychoosetheinitialcentroidsfromtheavailabledatapoints.ThisisverymuchlikerandomlyselectingcentroidsforK-means.

SelectionofanObject

Thefirststepintheloop(line3)istheselectionofthenextobject.Thisisfairlystraightforward,buttherearesomedifficulties.Becauseconvergencecan

1:Initializethecentroids.2:repeat3:Selectthenextobject.4:Determinetheclosestcentroidtotheobject.5:Updatethiscentroidandthecentroidsthatareclose,i.e.,inaspecifiedneighborhood.6:untilThecentroidsdon’tchangemuchorathresholdisexceeded.7:Assigneachobjecttoitsclosestcentroidandreturnthecentroidsandclusters.

requiremanysteps,eachdataobjectmaybeusedmultipletimes,especiallyifthenumberofobjectsissmall.However,ifthenumberofobjectsislarge,thennoteveryobjectneedstobeused.Itisalsopossibletoenhancetheinfluenceofcertaingroupsofobjectsbyincreasingtheirfrequencyinthetrainingset.

Assignment

Thedeterminationoftheclosestcentroid(line4)isalsorelativelystraightforward,althoughitrequiresthespecificationofadistancemetric.TheEuclideandistancemetricisoftenused,asisthedotproductmetric.Whenusingthedotproductdistance,thedatavectorsaretypicallynormalizedbeforehandandthereferencevectorsarenormalizedateachstep.Insuchcases,usingthedotproductmetricisequivalenttousingthecosinemeasure.

Update

Theupdatestep(line5)isthemostcomplicated.Let bethecentroids.(Forarectangulargrid,notethatkistheproductofthenumberofrowsandthenumberofcolumns.)Fortimestept,let bethecurrentobject(point)andassumethattheclosestcentroidto is .Then,fortime ,the centroidisupdatedbyusingthefollowingequation.(Wewillseeshortlythattheupdateisreallyrestrictedtocentroidswhoseneuronsareinasmallneighborhoodof .)

Thus,attimet,acentroid isupdatedbyaddingaterm, ,whichisproportionaltothedifference, ,betweenthecurrentobject,

,andcentroid, . ,determinestheeffectthatthedifference,,willhaveandischosensothat(1)itdiminisheswithtimeand(2)it

enforcesaneighborhoodeffect,i.e.,theeffectofanobjectisstrongestonthe

m1,…, mk

p(t)p(t) mj t+1

jth

mj

mj(t+1)=mj(t)+hj(t)(p(t)−mj(t)) (8.16)

mj(t) hj(t)   (p(t)−mj(t))p(t)−mj(t)

p(t) mj(t) hj(t) p(t)−mj(t)

centroidsclosesttothecentroid .Herewearereferringtothedistanceinthegrid,notthedistanceinthedataspace.Typically, ischosentobeoneofthefollowingtwofunctions:

Thesefunctionsrequiremoreexplanation. isalearningrateparameter,,whichdecreasesmonotonicallywithtimeandcontrolstherateof

convergence. isthetwo-dimensionalpointthatgivesthegridcoordinatesofthe centroid. istheEuclideandistancebetweenthegridlocationofthetwocentroids,i.e., .Consequently,forcentroidswhosegridlocationsarefarfromthegridlocationofcentroid ,theinfluenceofobject willbeeithergreatlydiminishedornon-existent.Finally,notethatσisthetypicalGaussianvarianceparameterandcontrolsthewidthoftheneighborhood,i.e.,asmallσwillyieldasmallneighborhood,whilealargeσwillyieldawideneighborhood.Thethresholdusedforthestepfunctionalsocontrolstheneighborhoodsize.

Remember,itistheneighborhoodupdatingtechniquethatenforcesarelationship(ordering)betweencentroidsassociatedwithneighboringneurons.

Termination

Decidingwhenwearecloseenoughtoastablesetofcentroidsisanimportantissue.Ideally,iterationshouldcontinueuntilconvergenceoccurs,thatis,untilthereferencevectorseitherdonotchangeorchangeverylittle.Therateofconvergencewilldependonanumberoffactors,suchasthedataand .Wewillnotdiscusstheseissuesfurther,excepttomentionthat,ingeneral,convergencecanbeslowandisnotguaranteed.

mjhj(t)

hj(t)=α(t)exp(−dist(rj,rk)2/(2σ2(t))          (Gaussianfunction)hj(t)=α(t) if dist(rj,rk)≤th

α(t)0<α(t)<1

rk=(xk,yk)kth dist(rj,rk)

(xj−xk)2+(yj−yk)2mj

p(t)

α(t)

Example8.6(DocumentData).Wepresenttwoexamples.Inthefirstcase,weapplySOMwitha4-by-4hexagonalgridtodocumentdata.Weclustered3204newspaperarticlesfromtheLosAngelesTimes,whichcomefrom6differentsections:Entertainment,Financial,Foreign,Metro,National,andSports.Figure8.8 showstheSOMgrid.Wehaveusedahexagonalgrid,whichallowseachcentroidtohavesiximmediateneighborsinsteadoffour.EachSOMgridcell(cluster)hasbeenlabeledwiththemajorityclasslabeloftheassociatedpoints.Theclustersofeachparticularcategoryformcontiguousgroups,andtheirpositionrelativetoothercategoriesofclustersgivesusadditionalinformation,e.g.,thattheMetrosectioncontainsstoriesrelatedtoallothersections.

Figure8.8.VisualizationoftherelationshipsbetweenSOMclusterforLosAngelesTimesdocumentdataset.

Example8.7(Two-DimensionalPoints).Inthesecondcase,weusearectangularSOMandasetoftwo-dimensionaldatapoints.Figure8.9(a) showsthepointsandthepositionsofthe36referencevectors(shownasx’s)producedbySOM.Thepointsarearrangedinacheckerboardpatternandaresplitintofiveclasses:circles,triangles,squares,diamonds,andhexagons(stars).A6-by-6two-dimensionalrectangulargridofcentroidswasusedwithrandominitialization.AsFigure8.9(a) shows,thecentroidstendtodistributethemselvestothedenseareas.Figure8.9(b) indicatesthemajorityclassofthepointsassociatedwiththatcentroid.Theclustersassociatedwithtrianglepointsareinonecontiguousarea,asarethecentroidsassociatedwiththefourothertypesofpoints.ThisisaresultoftheneighborhoodconstraintsenforcedbySOM.Whiletherearethesamenumberofpointsineachofthefivegroups,noticealsothatthecentroidsarenotevenlydistributed.Thisispartlyduetotheoveralldistributionofpointsandpartlyanartifactofputtingeachcentroidinasinglecluster.

Figure8.9.SOMappliedtotwo-dimensionaldatapoints.

ApplicationsOncetheSOMvectorsarefound,theycanbeusedformanypurposesotherthanclustering.Forexample,withatwo-dimensionalSOM,itispossibletoassociatevariousquantitieswiththegridpointsassociatedwitheachcentroid(cluster)andtovisualizetheresultsviavarioustypesofplots.Forexample,plottingthenumberofpointsassociatedwitheachclusteryieldsaplotthatrevealsthedistributionofpointsamongclusters.Atwo-dimensionalSOMisanonlinearprojectionoftheoriginalprobabilitydistributionfunctionintotwodimensions.Thisprojectionattemptstopreservetopologicalfeatures;thus,usingSOMtocapturethestructureofthedatahasbeencomparedtotheprocessof“pressingaflower.”

StrengthsandLimitationsSOMisaclusteringtechniquethatenforcesneighborhoodrelationshipsontheresultingclustercentroids.Becauseofthis,clustersthatareneighborsaremorerelatedtooneanotherthanclustersthatarenot.Suchrelationshipsfacilitatetheinterpretationandvisualizationoftheclusteringresults.Indeed,thisaspectofSOMhasbeenexploitedinmanyareas,suchasvisualizingwebdocumentsorgenearraydata.

SOMalsohasanumberoflimitations,whicharelistednext.SomeofthelistedlimitationsareonlyvalidifweconsiderSOMtobeastandardclusteringtechniquethataimstofindthetrueclustersinthedata,ratherthanatechniquethatusesclusteringtohelpdiscoverthestructureofthedata.Also,

someoftheselimitationshavebeenaddressedeitherbyextensionsofSOMorbyclusteringalgorithmsinspiredbySOM.(SeetheBibliographicNotes.)

Theusermustchoosethesettingsofparameters,theneighborhoodfunction,thegridtype,andthenumberofcentroids.ASOMclusteroftendoesnotcorrespondtoasinglenaturalcluster.Insomecases,aSOMclustermightencompassseveralnaturalclusters,whileinothercasesasinglenaturalclusterissplitintoseveralSOMclusters.ThisproblemispartlyduetotheuseofagridofcentroidsandpartlyduetothefactthatSOM,likeotherprototype-basedclusteringtechniques,tendstosplitorcombinenaturalclusterswhentheyareofvaryingsizes,shapes,anddensities.SOMlacksaspecificobjectivefunction.SOMattemptstofindasetofcentroidsthatbestapproximatethedata,subjecttothetopographicconstraintsamongthecentroids,butthesuccessofSOMindoingthiscannotbeexpressedbyafunction.ThiscanmakeitdifficulttocomparedifferentSOMclusteringresults.SOMisnotguaranteedtoconverge,although,inpractice,ittypicallydoes.

8.3Density-BasedClusteringInSection7.4 ,weconsideredDBSCAN,asimple,buteffectivealgorithmforfindingdensity-basedclusters,i.e.,denseregionsofobjectsthataresurroundedbylow-densityregions.Thissectionexaminesadditionaldensity-basedclusteringtechniquesthataddressissuesofefficiency,findingclustersinsubspaces,andmoreaccuratelymodelingdensity.First,weconsidergrid-basedclustering,whichbreaksthedataspaceintogridcellsandthenformsclustersfromcellsthataresufficientlydense.Suchanapproachcanbeefficientandeffective,atleastforlow-dimensionaldata.Next,weconsidersubspaceclustering,whichlooksforclusters(denseregions)insubsetsofalldimensions.Foradataspacewithndimensions,potentially subspacesneedtobesearched,andthusanefficienttechniqueisneededtodothis.CLIQUEisagrid-basedclusteringalgorithmthatprovidesanefficientapproachtosubspaceclusteringbasedontheobservationthatdenseareasinahigh-dimensionalspaceimplytheexistenceofdenseareasinlower-dimensionalspace.Finally,wedescribeDENCLUE,aclusteringtechniquethatuseskerneldensityfunctionstomodeldensityasthesumoftheinfluencesofindividualdataobjects.WhileDENCLUEisnotfundamentallyagrid-basedtechnique,itdoesemployagrid-basedapproachtoimproveefficiency.

8.3.1Grid-BasedClustering

Agridisanefficientwaytoorganizeasetofdata,atleastinlowdimensions.Theideaistosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals,creatingasetofgridcells.(Weareassuming,forthis

2n−1

discussionandtheremainderofthesection,thatourattributesareordinal,interval,orcontinuous.)Eachobjectfallsintoagridcellwhosecorrespondingattributeintervalscontainthevaluesoftheobject.Objectscanbeassignedtogridcellsinonepassthroughthedata,andinformationabouteachcell,suchasthenumberofpointsinthecell,canalsobegatheredatthesametime.

Thereareanumberofwaystoperformclusteringusingagrid,butmostapproachesarebasedondensity,atleastinpart,andthus,inthissection,wewillusegrid-basedclusteringtomeandensity-basedclusteringusingagrid.Algorithm8.4 describesabasicapproachtogrid-basedclustering.Variousaspectsofthisapproachareexplorednext.

Algorithm8.4Basicgrid-basedclustering

algorithm.

DefiningGridCellsThisisakeystepintheprocess,butalsotheleastwelldefined,astherearemanywaystosplitthepossiblevaluesofeachattributeintoanumberofcontiguousintervals.Forcontinuousattributes,onecommonapproachisto

1:Defineasetofgridcells.2:Assignobjectstotheappropriatecellsandcomputethedensityofeachcell.3:Eliminatecellshavingadensitybelowaspecifiedthreshold,τ.4:Formclustersfromcontiguous(adjacent)groupsofdensecells.

splitthevaluesintoequalwidthintervals.Ifthisapproachisappliedtoeachattribute,thentheresultinggridcellsallhavethesamevolume,andthedensityofacellisconvenientlydefinedasthenumberofpointsinthecell.

However,moresophisticatedapproachescanalsobeused.Inparticular,forcontinuousattributesanyofthetechniquesthatarecommonlyusedtodiscretizeattributescanbeapplied.(SeeSection2.3.6 .)Inadditiontotheequalwidthapproachalreadymentioned,thisincludes(1)breakingthevaluesofanattributeintointervalssothateachintervalcontainsanequalnumberofpoints,i.e.,equalfrequencydiscretization,or(2)usingclustering.Anotherapproach,whichisusedbythesubspaceclusteringalgorithmMAFIA,initiallybreaksthesetofvaluesofanattributeintoalargenumberofequalwidthintervalsandthencombinesintervalsofsimilardensity.

Regardlessoftheapproachtaken,thedefinitionofthegridhasastrongimpactontheclusteringresults.Wewillconsiderspecificaspectsofthislater.

TheDensityofGridCellsAnaturalwaytodefinethedensityofagridcell(oramoregenerallyshapedregion)isasthenumberofpointsdividedbythevolumeoftheregion.Inotherwords,densityisthenumberofpointsperamountofspace,regardlessofthedimensionalityofthatspace.Specific,low-dimensionalexamplesofdensityarethenumberofroadsignspermile(onedimension),thenumberofeaglespersquarekilometerofhabitat(twodimensions),andthenumberofmoleculesofagaspercubiccentimeter(threedimensions).Asmentioned,however,acommonapproachistousegridcellsthathavethesamevolumesothatthenumberofpointspercellisadirectmeasureofthecell’sdensity.

Example8.8(Grid-BasedDensity).

Figure8.10 showstwosetsoftwo-dimensionalpointsdividedinto49cellsusinga7-by-7grid.Thefirstsetcontains200pointsgeneratedfromauniformdistributionoveracirclecenteredat(2,3)ofradius2,whilethesecondsethas100pointsgeneratedfromauniformdistributionoveracirclecenteredat(6,3)ofradius1.ThecountsforthegridcellsareshowninTable8.2 .Sincethecellshaveequalvolume(area),wecanconsiderthesevaluestobethedensitiesofthecells.

Figure8.10.Grid-baseddensity.

Table8.2.Pointcountsforgridcells.

0 0 0 0 0 0 0

0 0 0 0 0 0 0

4 17 18 6 0 0 0

14 14 13 13 0 18 27

11 18 10 21 0 24 31

3 20 14 4 0 0 0

0 0 0 0 0 0 0

FormingClustersfromDenseGridCellsFormingclustersfromadjacentgroupsofdensecellsisrelativelystraightforward.(InFigure8.10 ,forexample,itisclearthattherewouldbetwoclusters.)Thereare,however,someissues.Weneedtodefinewhatwemeanbyadjacentcells.Forexample,doesatwo-dimensionalgridcellhave4adjacentcellsor8?Also,weneedanefficienttechniquetofindtheadjacentcells,particularlywhenonlyoccupiedcellsarestored.

TheclusteringapproachdefinedbyAlgorithm8.4 hassomelimitationsthatcouldbeaddressedbymakingthealgorithmslightlymoresophisticated.Forexample,therearelikelytobepartiallyemptycellsontheboundaryofacluster.Often,thesecellsarenotdense.Ifso,theywillbediscardedandpartsofaclusterwillbelost.Figure8.10 andTable8.2 showthatfourpartsofthelargerclusterwouldbelostifthedensitythresholdis9.Theclusteringprocesscouldbemodifiedtoavoiddiscardingsuchcells,althoughthiswouldrequireadditionalprocessing.

Itisalsopossibletoenhancebasicgrid-basedclusteringbyusingmorethanjustdensityinformation.Inmanycases,thedatahasbothspatialandnon-spatialattributes.Inotherwords,someoftheattributesdescribethelocationofobjectsintimeorspace,whileotherattributesdescribeotheraspectsoftheobjects.Acommonexampleishouses,whichhavebothalocationandanumberofothercharacteristics,suchaspriceorfloorspaceinsquarefeet.Becauseofspatial(ortemporal)autocorrelation,objectsinaparticularcelloftenhavesimilarvaluesfortheirotherattributes.Insuchcases,itispossibletofilterthecellsbasedonthestatisticalpropertiesofoneormorenon-spatial

attributes,e.g.,averagehouseprice,andthenformclustersbasedonthedensityoftheremainingpoints.

StrengthsandLimitationsOnthepositiveside,grid-basedclusteringcanbeveryefficientandeffective.Givenapartitioningofeachattribute,asinglepassthroughthedatacandeterminethegridcellofeveryobjectandthecountofeverygrid.Also,eventhoughthenumberofpotentialgridcellscanbehigh,gridcellsneedtobecreatedonlyfornon-emptycells.Thus,thetimeandspacecomplexityofdefiningthegrid,assigningeachobjecttoacell,andcomputingthedensityofeachcellisonly ,wheremisthenumberofpoints.Ifadjacent,occupiedcellscanbeefficientlyaccessed,forexample,byusingasearchtree,thentheentireclusteringprocesswillbehighlyefficient,e.g.,withatimecomplexityof

Forthisreason,thegrid-basedapproachtodensityclusteringformsthebasisofanumberofclusteringalgorithms,suchasSTING,GRIDCLUS,WaveCluster,Bang-Clustering,CLIQUE,andMAFIA.

Onthenegativeside,grid-basedclustering,likemostdensity-basedclusteringschemes,isverydependentonthechoiceofthedensitythresholdτ.Ifτistoohigh,thenclusterswillbelost.Ifτistoolow,twoclustersthatshouldbeseparatemaybejoined.Furthermore,ifthereareclustersandnoiseofdifferingdensities,thenitmightnotbepossibletofindasinglevalueofτthatworksforallpartsofthedataspace.

Therearealsoanumberofissuesrelatedtothegrid-basedapproach.InFigure8.10 ,forexample,therectangulargridcellsdonotaccuratelycapturethedensityofthecircularboundaryareas.Wecouldattempttoalleviatethisproblembymakingthegridfiner,butthenumberofpointsinthegridcellsassociatedwithaclusterwouldlikelyshowmorefluctuationbecausepointsintheclusterarenotevenlydistributed.Indeed,somegridcells,

O(m)

O(mlogm).

includingthoseintheinteriorofthecluster,mightevenbeempty.Anotherissueisthat,dependingontheplacementorsizeofthecells,agroupofpointscanappearinjustonecellorbesplitbetweenseveraldifferentcells.Thesamegroupofpointsmightbepartofaclusterinthefirstcase,butbediscardedinthesecond.Finally,asdimensionalityincreases,thenumberofpotentialgridcellsincreasesrapidly—exponentiallyinthenumberofdimensions.Eventhoughitisnotnecessarytoexplicitlyconsideremptygridcells,itcaneasilyhappenthatmostgridcellscontainasingleobject.Inotherwords,grid-basedclusteringtendstoworkpoorlyforhigh-dimensionaldata.

8.3.2SubspaceClustering

Theclusteringtechniquesconsidereduntilnowfoundclustersbyusingalloftheattributes.However,ifonlysubsetsofthefeaturesareconsidered,i.e.,subspacesofthedata,thentheclustersthatwefindcanbequitedifferentfromonesubspacetoanother.Therearetworeasonsthatsubspaceclustersmightbeinteresting.First,thedatamaybeclusteredwithrespecttoasmallsetofattributes,butrandomlydistributedwithrespecttotheremainingattributes.Second,therearecasesinwhichdifferentclustersexistindifferentsetsofdimensions.Consideradatasetthatrecordsthesalesofvariousitemsatvarioustimes.(Thetimesarethedimensionsandtheitemsaretheobjects.)Someitemsmightshowsimilarbehavior(clustertogether)forparticularsetsofmonths,e.g.,summer,butdifferentclusterswouldlikelybecharacterizedbydifferentmonths(dimensions).

Example8.9(SubspaceClusters).Figure8.11(a) showsasetofpointsinthree-dimensionalspace.Therearethreeclustersofpointsinthefullspace,whicharerepresentedby

squares,diamonds,andtriangles.Inaddition,thereisonesetofpoints,representedbycircles,thatisnotaclusterinthree-dimensionalspace.Eachdimension(attribute)oftheexampledatasetissplitintoafixednumber ofequalwidthintervals.Thereare intervals,eachofsize0.1.Thispartitionsthedataspaceintorectangularcellsofequalvolume,andthus,thedensityofeachunitisthefractionofpointsitcontains.Clustersarecontiguousgroupsofdensecells.Toillustrate,ifthethresholdforadensecellis or6%ofthepoints,thenthreeone-dimensionalclusterscanbeidentifiedinFigure8.12 ,whichshowsahistogramofthedatapointsofFigure8.11(a) forthexattribute.

(η) η=20

ξ=0.06,

Figure8.11.Examplefiguresforsubspaceclustering.

Figure8.12.Histogramshowingthedistributionofpointsforthexattribute.

Figure8.11(b) showsthepointsplottedinthexyplane.(Thezattributeisignored.)Thisfigurealsocontainshistogramsalongthexandyaxesthatshowthedistributionofthepointswithrespecttotheirxandycoordinates,respectively.(Ahigherbarindicatesthatthecorrespondingintervalcontainsrelativelymorepoints,andviceversa.)Whenweconsidertheyaxis,weseethreeclusters.Oneisfromthecirclepointsthatdonotformaclusterinthefullspace,oneconsistsofthesquarepoints,andoneconsistsofthediamondandtrianglepoints.Therearealsothreeclustersinthexdimension;theycorrespondtothethreeclusters—diamonds,triangles,andsquares—inthefullspace.Thesepointsalsoformdistinctclustersinthexyplane.Figure8.11(c) showsthepointsplottedinthexzplane.Therearetwoclusters,ifweconsideronlythezattribute.Oneclustercorrespondstothepointsrepresentedbycircles,whiletheother

consistsofthediamond,triangle,andsquarepoints.Thesepointsalsoformdistinctclustersinthexzplane.InFigure8.11(d) ,therearethreeclusterswhenweconsiderboththeyandzcoordinates.Oneoftheseclustersconsistsofthecircles;anotherconsistsofthepointsmarkedbysquares.Thediamondsandtrianglesformasingleclusterintheyzplane.

Thesefiguresillustrateacoupleofimportantfacts.First,asetofpoints—thecircles—maynotformaclusterintheentiredataspace,butmayformaclusterinasubspace.Second,clustersthatexistinthefulldataspace(orevenasubspace)showupasclustersinlower-dimensionalspaces.Thefirstfacttellsusthatweneedtolookinsubsetsofdimensionstofindclusters,whilethesecondfacttellsusthatmanyoftheclusterswefindinsubspacesarelikelytobe“shadows”(projections)ofhigher-dimensionalclusters.Thegoalistofindtheclustersandthedimensionsinwhichtheyexist,butwearetypicallynotinterestedinclustersthatareprojectionsofhigher-dimensionalclusters.

CLIQUECLIQUE(CLusteringInQUEst)isagrid-basedclusteringalgorithmthatmethodicallyfindssubspaceclusters.Itisimpracticaltocheckeachsubspaceforclustersbecausethenumberofsuchsubspacesisexponentialinthenumberofdimensions.Instead,CLIQUEreliesonthefollowingproperty:

Monotonicitypropertyofdensity-basedclustersIfasetofpointsformsadensity-basedclusterinkdimensions(attributes),thenthesamesetofpointsisalsopartofadensity-basedclusterinallpossiblesubsetsofthosedimensions.

Considerasetofadjacent,k-dimensionalcellsthatformacluster;i.e.,thereisacollectionofadjacentcellsthathaveadensityabovethespecifiedthreshold

ξ.Acorrespondingsetofcellsin dimensionscanbefoundbyomittingoneofthekdimensions(attributes).Thelower-dimensionalcellsarestilladjacent,andeachlow-dimensionalcellcontainsallpointsofthecorrespondinghigh-dimensionalcell.Itcancontainadditionalpointsaswell.Thus,alow-dimensionalcellhasadensitygreaterthanorequaltothatofitscorrespondinghigh-dimensionalcell.Consequently,thelow-dimensionalcellsformacluster;i.e.,thepointsformaclusterwiththereducedsetofattributes.

Algorithm8.5 givesasimplifiedversionofthestepsinvolvedinCLIQUE.Conceptually,theCLIQUEalgorithmissimilartotheApriorialgorithmforfindingfrequentitemsets.SeeChapter5 .

Algorithm8.5CLIQUE.

k−1

1:Findallthedenseareasintheone-dimensionalspacescorrespondingtoeachattribute.Thisisthesetofdenseone-dimensionalcells.2:3:repeat4:Generateallcandidatedensek-dimensionalcellsfromdense

-dimensionalcells.5:Eliminatecellsthathavefewerthanξpoints.6:7:untilTherearenocandidatedensek-dimensionalcells.8:Findclustersbytakingtheunionofalladjacent,high-densitycells.9:Summarizeeachclusterusingasmallsetofinequalitiesthatdescribetheattributerangesofthecellsinthecluster.

k←2

(k−1)

k←k+1

StrengthsandLimitationsofCLIQUEThemostusefulfeatureofCLIQUEisthatitprovidesanefficienttechniqueforsearchingsubspacesforclusters.Sincethisapproachisbasedonthewell-knownAprioriprinciplefromassociationanalysis,itspropertiesarewellunderstood.AnotherusefulfeatureisCLIQUE’sabilitytosummarizethelistofcellsthatcomprisesaclusterwithasmallsetofinequalities.

ManylimitationsofCLIQUEareidenticaltothepreviouslydiscussedlimitationsofothergrid-baseddensityschemes.OtherlimitationsaresimilartothoseoftheApriorialgorithm.Specifically,justasfrequentitemsetscanshareitems,theclustersfoundbyCLIQUEcanshareobjects.Allowingclusterstooverlapcangreatlyincreasethenumberofclustersandmakeinterpretationdifficult.AnotherissueisthatApriori—likeCLIQUE—potentiallyhasexponentialtimecomplexity.Inparticular,CLIQUEwillhavedifficultyiftoomanydensecellsaregeneratedatlowervaluesofk.Raisingthedensitythresholdξcanalleviatethisproblem.StillanotherpotentiallimitationofCLIQUEisexploredinExercise20 onpage702.

8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClustering

DENCLUE(DENsityCLUstEring)isadensity-basedclusteringapproachthatmodelstheoveralldensityofasetofpointsasthesumofinfluencefunctionsassociatedwitheachpoint.Theresultingoveralldensityfunctionwillhavelocalpeaks,i.e.,localdensitymaxima,andtheselocalpeakscanbeusedtodefineclustersinanaturalway.Specifically,foreachdatapoint,ahill-climbingprocedurefindsthenearestpeakassociatedwiththatpoint,andthesetofall

datapointsassociatedwithaparticularpeak(calledalocaldensityattractor)becomesacluster.However,ifthedensityatalocalpeakistoolow,thenthepointsintheassociatedclusterareclassifiedasnoiseanddiscarded.Also,ifalocalpeakcanbeconnectedtoasecondlocalpeakbyapathofdatapoints,andthedensityateachpointonthepathisabovetheminimumdensitythreshold,thentheclustersassociatedwiththeselocalpeaksaremerged.Therefore,clustersofanyshapecanbediscovered.

Example8.10(DENCLUEDensity).WeillustratetheseconceptswithFigure8.13 ,whichshowsapossibledensityfunctionforaone-dimensionaldataset.PointsA–Earethepeaksofthisdensityfunctionandrepresentlocaldensityattractors.Thedottedverticallinesdelineatelocalregionsofinfluenceforthelocaldensityattractors.Pointsintheseregionswillbecomecenter-definedclusters.Thedashedhorizontallineshowsadensitythreshold,ξ.Allpointsassociatedwithalocaldensityattractorthathasadensitylessthanξ,suchasthoseassociatedwithC,willbediscarded.Allotherclustersarekept.Notethatthiscanincludepointswhosedensityislessthanξ,aslongastheyareassociatedwithlocaldensityattractorswhosedensityisgreaterthanξ.Finally,clustersthatareconnectedbyapathofpointswithadensityaboveξarecombined.ClustersAandBwouldremainseparate,whileclustersDandEwouldbecombined.

*****

Figure8.13.IllustrationofDENCLUEdensityconceptsinonedimension.

Thehigh-leveldetailsoftheDENCLUEalgorithmaresummarizedinAlgorithm8.6 .Next,weexplorevariousaspectsofDENCLUEinmoredetail.First,weprovideabriefoverviewofkerneldensityestimationandthenpresentthegrid-basedapproachthatDENCLUEusesforapproximatingthedensity.

Algorithm8.6DENCLUEalgorithm.1:Deriveadensityfunctionforthespaceoccupiedbythedatapoints.2:Identifythepointsthatarelocalmaxima.(Thesearethedensityattractors.)3:Associateeachpointwithadensityattractorbymovinginthedirectionofmaximumincreaseindensity.4:Defineclustersconsistingofpointsassociatedwithaparticulardensityattractor.5:Discardclusterswhosedensityattractorhasadensitylessthanauser-specifiedthresholdofξ.

KernelDensityEstimationDENCLUEisbasedonawell-developedareaofstatisticsandpatternrecognitionthatisknownaskerneldensityestimation.Thegoalofthiscollectionoftechniques(andmanyotherstatisticaltechniquesaswell)istodescribethedistributionofthedatabyafunction.Forkerneldensityestimation,thecontributionofeachpointtotheoveralldensityfunctionisexpressedbyaninfluenceorkernelfunction.Theoveralldensityfunctionissimplythesumoftheinfluencefunctionsassociatedwitheachpoint.

Typically,theinfluenceorkernelfunctionissymmetric(thesameinalldirections)anditsvalue(contribution)decreasesasthedistancefromthepointincreases.Forexample,foraparticularpoint,x,theGaussianfunction,

isoftenusedasakernelfunction.σisaparameter,analogoustostandarddeviation,whichgovernshowquicklytheinfluenceofapointdiminisheswithdistance.Figure8.14(a) showswhataGaussiandensityfunctionwouldlooklikeforasinglepointintwodimensions,whileFigures8.14(c) and8.14(d) showtheoveralldensityfunctionproducedbyapplyingtheGaussianinfluencefunctiontothesetofpointsshowninFigure8.14(b) .

6:Combineclustersthatareconnectedbyapathofpointsthatallhaveadensityofξorhigher.

K(y)=e−distance(x,y)2/2σ2,

Figure8.14.ExampleoftheGaussianinfluence(kernel)functionandanoveralldensityfunction.

ImplementationIssuesComputationofkerneldensitycanbequiteexpensive,andDENCLUEusesanumberofapproximationstoimplementitsbasicapproachefficiently.First,itexplicitlycomputesdensityonlyatdatapoints.However,thisstillwouldresultinan timecomplexitybecausethedensityateachpointisafunctionofO(m2)

thedensitycontributedbyeverypoint.Toreducethetimecomplexity,DENCLUEusesagrid-basedimplementationtoefficientlydefineneighborhoodsandthuslimitthenumberofpointsthatneedtobeconsideredtodefinethedensityatapoint.First,apreprocessingstepcreatesasetofgridcells.Onlyoccupiedcellsarecreated,andthesecellsandtheirrelatedinformationcanbeefficientlyaccessedviaasearchtree.Then,whencomputingthedensityofapointandfindingitsnearestdensityattractor,DENCLUEconsidersonlythepointsintheneighborhood;i.e.,pointsinthesamecellandincellsthatareconnectedtothepoint’scell.Whilethisapproachcansacrificesomeaccuracywithrespecttodensityestimation,computationalcomplexityisgreatlyreduced.

StrengthsandLimitationsofDENCLUEDENCLUEhasasolidtheoreticalfoundationbecauseitisbasedontheconceptofkerneldensityestimation,whichisawell-developedareaofstatistics.Forthisreason,DENCLUEprovidesamoreflexibleandpotentiallymoreaccuratewaytocomputedensitythanothergrid-basedclusteringtechniquesandDBSCAN.(DBSCANisaspecialcaseofDENCLUE.)Anapproachbasedonkerneldensityfunctionsisinherentlycomputationallyexpensive,butDENCLUEemploysgrid-basedtechniquestoaddresssuchissues.Nonetheless,DENCLUEcanbemorecomputationallyexpensivethanotherdensity-basedclusteringtechniques.Also,theuseofagridcanadverselyaffecttheaccuracyofthedensityestimation,anditmakesDENCLUEsusceptibletoproblemscommontogrid-basedapproaches;e.g.,thedifficultyofchoosingthepropergridsize.Moregenerally,DENCLUEsharesmanyofthestrengthsandlimitationsofotherdensity-basedapproaches.Forinstance,DENCLUEisgoodathandlingnoiseandoutliersanditcanfindclustersofdifferentshapesandsize,butithastroublewithhigh-dimensionaldataanddatathatcontainsclustersofwidelydifferentdensities.

8.4Graph-BasedClusteringSection7.3 discussedanumberofclusteringtechniquesthattookagraph-basedviewofdata,inwhichdataobjectsarerepresentedbynodesandtheproximitybetweentwodataobjectsisrepresentedbytheweightoftheedgebetweenthecorrespondingnodes.Thissectionconsiderssomeadditionalgraph-basedclusteringalgorithmsthatuseanumberofkeypropertiesandcharacteristicsofgraphs.Thefollowingaresomekeyapproaches,differentsubsetsofwhichareemployedbythesealgorithms.

1. Sparsifytheproximitygraphtokeeponlytheconnectionsofanobjectwithitsnearestneighbors.Thissparsificationisusefulforhandlingnoiseandoutliers.Italsoallowstheuseofhighlyefficientgraphpartitioningalgorithmsthathavebeendevelopedforsparsegraphs.

2. Defineasimilaritymeasurebetweentwoobjectsbasedonthenumberofnearestneighborsthattheyshare.Thisapproach,whichisbasedontheobservationthatanobjectanditsnearestneighborsusuallybelongtothesameclass,isusefulforovercomingproblemswithhighdimensionalityandclustersofvaryingdensity.

3. Definecoreobjectsandbuildclustersaroundthem.Todothisforgraph-basedclustering,itisnecessarytointroduceanotionofdensity-basedonaproximitygraphorasparsifiedproximitygraph.AswithDBSCAN,buildingclustersaroundcoreobjectsleadstoaclusteringtechniquethatcanfindclustersofdifferingshapesandsizes.

4. Usetheinformationintheproximitygraphtoprovideamoresophisticatedevaluationofwhethertwoclustersshouldbemerged.Specifically,twoclustersaremergedonlyiftheresultingclusterwillhavecharacteristicssimilartotheoriginaltwoclusters.

Webeginbydiscussingthesparsificationofproximitygraphs,providingthreeexamplesoftechniqueswhoseapproachtoclusteringisbasedsolelyonthistechnique:MST,whichisequivalenttothesinglelinkclusteringalgorithm,Opossum,andspectralclustering.WethendiscussChameleon,ahierarchicalclusteringalgorithmthatusesanotionofself-similaritytodetermineifclustersshouldbemerged.WenextdefineSharedNearestNeighbor(SNN)similarity,anewsimilaritymeasure,andintroducetheJarvis-Patrickclusteringalgorithm,whichusesthissimilarity.Finally,wediscusshowtodefinedensityandcoreobjectsbasedonSNNsimilarityandintroduceanSNNdensity-basedclusteringalgorithm,whichcanbeviewedasDBSCANwithanewsimilaritymeasure.

8.4.1Sparsification

Thembymproximitymatrixformdatapointscanberepresentedasadensegraphinwhicheachnodeisconnectedtoallothersandtheweightoftheedgebetweenanypairofnodesreflectstheirpairwiseproximity.Althougheveryobjecthassomelevelofsimilaritytoeveryotherobject,formostdatasets,objectsarehighlysimilartoasmallnumberofobjectsandweaklysimilartomostotherobjects.Thispropertycanbeusedtosparsifytheproximitygraph(matrix),bysettingmanyoftheselow-similarity(high-dissimilarity)valuesto0beforebeginningtheactualclusteringprocess.Thesparsificationmaybeperformed,forexample,bybreakingalllinksthathaveasimilarity(dissimilarity)below(above)aspecifiedthresholdorbykeepingonlylinkstothek-nearestneighborsofpoint.Thislatterapproachcreateswhatiscalledak-nearestneighborgraph.

Sparsificationhasseveralbeneficialeffects:

Datasizeisreduced.Theamountofdatathatneedstobeprocessedtoclusterthedataisdrasticallyreduced.Sparsificationcanofteneliminatemorethan99%oftheentriesinaproximitymatrix.Asaresult,thesizeofproblemsthatcanbehandledisincreased.Clusteringoftenworksbetter.Sparsificationtechniqueskeeptheconnectionstotheirnearestneighborsofanobjectwhilebreakingtheconnectionstomoredistantobjects.Thisisinkeepingwiththenearestneighborprinciplethatthenearestneighborsofanobjecttendtobelongtothesameclass(cluster)astheobjectitself.Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.Graphpartitioningalgorithmscanbeused.Therehasbeenaconsiderableamountofworkonheuristicalgorithmsforfindingmincutpartitioningsofsparsegraphs,especiallyintheareasofparallelcomputingandthedesignofintegratedcircuits.Sparsificationoftheproximitygraphmakesitpossibletousegraphpartitioningalgorithmsfortheclusteringprocess.Forexample,OpossumandChameleonusegraphpartitioning.

Sparsificationoftheproximitygraphshouldberegardedasaninitialstepbeforetheuseofactualclusteringalgorithms.Intheory,aperfectsparsificationcouldleavetheproximitymatrixsplitintoconnectedcomponentscorrespondingtothedesiredclusters,butinpractice,thisrarelyhappens.Itiseasyforasingleedgetolinktwoclustersorforasingleclustertobesplitintoseveraldisconnectedsubclusters.AsweshallseewhenwediscussJarvis-PatrickandSNNdensity-basedclustering,thesparseproximitygraphisoftenmodifiedtoyieldanewproximitygraph.Thisnewproximitygraphcanagainbesparsified.Clusteringalgorithmsworkwiththeproximitygraphthatistheresultofallthesepreprocessingsteps.ThisprocessissummarizedinFigure8.15 .

Figure8.15.Idealprocessofclusteringusingsparsification.

8.4.2MinimumSpanningTree(MST)Clustering

InSection7.3 ,wherewedescribedagglomerativehierarchicalclusteringtechniques,wementionedthatdivisivehierarchicalclusteringalgorithmsalsoexist.Wesawanexampleofonesuchtechnique,bisectingK-means,inSection7.2.3 .Anotherdivisivehierarchicaltechnique,MST,startswiththeminimumspanningtreeoftheproximitygraphandcanbeviewedasanapplicationofsparsificationforfindingclusters.Webrieflydescribethisalgorithm.Interestingly,thisalgorithmalsoproducesthesameclusteringassinglelinkagglomerativeclustering.SeeExercise13 onpage700.

Aminimumspanningtreeofagraphisasubgraphthat(1)hasnocycles,i.e.,isatree,(2)containsallthenodesofthegraph,and(3)hastheminimumtotaledgeweightofallpossiblespanningtrees.Theterminology,minimumspanningtree,assumesthatweareworkingonlywithdissimilaritiesordistances,andwewillfollowthisconvention.Thisisnotalimitation,however,sincewecanconvertsimilaritiestodissimilaritiesormodifythenotionofaminimumspanningtreetoworkwithsimilarities.Anexampleofaminimumspanningtreeforsometwo-dimensionalpointsisshowninFigure8.16 .

Figure8.16.Minimumspanningtreeforasetofsixtwo-dimensionalpoints.

TheMSTdivisivehierarchicalalgorithmisshowninAlgorithm8.7 .ThefirststepistofindtheMSToftheoriginaldissimilaritygraph.Notethataminimumspanningtreecanbeviewedasaspecialtypeofsparsifiedgraph.Step3canalsobeviewedasgraphsparsification.Hence,MSTcanbeviewedasaclusteringalgorithmbasedonthesparsificationofthedissimilaritygraph.

Algorithm8.7MSTdivisivehierarchical

clusteringalgorithm.1:Computeaminimumspanningtreeforthedissimilaritygraph.2:repeat3:Createanewclusterbybreakingthelinkcorrespondingtothelargestdissimilarity.

8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETIS

OPOSSUMisaclusteringtechniqueforclusteringsparse,high-dimensionaldata,e.g.,documentormarketbasketdata.LikeMST,itperformsclusteringbasedonthesparsificationofaproximitygraph.However,OPOSSUMusestheMETISalgorithm,whichwasspecificallycreatedforpartitioningsparsegraphs.ThestepsofOPOSSUMaregiveninAlgorithm8.8 .

Thesimilaritymeasuresusedarethoseappropriateforsparse,high-dimensionaldata,suchastheextendedJaccardmeasureorthecosinemeasure.TheMETISgraphpartitioningprogrampartitionsasparsegraphintokdistinctcomponents,wherekisauser-specifiedparameter,inorderto(1)minimizetheweightoftheedges(thesimilarity)betweencomponentsand(2)fulfillabalanceconstraint.OPOSSUMusesoneofthefollowingtwobalanceconstraints:(1)thenumberofobjectsineachclustermustberoughlythesame,or(2)thesumoftheattributevaluesmustberoughlythesame.Thesecondconstraintisusefulwhen,forexample,theattributevaluesrepresentthecostofanitem.

Algorithm8.8OPOSSUMclustering

algorithm.

4:untilOnlysingletonclustersremain.

1:Computeasparsifiedsimilaritygraph.

StrengthsandWeaknessesOPOSSUMissimpleandfast.Itpartitionsthedataintoroughlyequal-sizedclusters,which,dependingonthegoaloftheclustering,canbeviewedasanadvantageoradisadvantage.Becausetheyareconstrainedtobeofroughlyequalsize,clusterscanbebrokenorcombined.However,ifOPOSSUMisusedtogeneratealargenumberofclusters,thentheseclustersaretypicallyrelativelypurepiecesoflargerclusters.Indeed,OPOSSUMissimilartotheinitialstepoftheChameleonclusteringroutine,whichisdiscussednext.

8.4.4Chameleon:HierarchicalClusteringwithDynamicModeling

Agglomerativehierarchicalclusteringtechniquesoperatebymergingthetwomostsimilarclusters,wherethedefinitionofclustersimilaritydependsontheparticularalgorithm.Someagglomerativealgorithms,suchasgroupaverage,basetheirnotionofsimilarityonthestrengthoftheconnectionsbetweenthetwoclusters(e.g.,thepairwisesimilarityofpointsinthetwoclusters),whileothertechniques,suchasthesinglelinkmethod,usetheclosenessoftheclusters(e.g.,theminimumdistancebetweenpointsindifferentclusters)tomeasureclustersimilarity.Althoughtherearetwobasicapproaches,usingonlyoneofthesetwoapproachescanleadtomistakesinmergingclusters.ConsiderFigure8.17 ,whichshowsfourclusters.Ifweusetheclosenessofclusters(asmeasuredbytheclosesttwopointsindifferentclusters)asour

2:Partitionthesimilaritygraphintokdistinctcomponents(clusters)usingMETIS.

mergingcriterion,thenwewouldmergethetwocircularclusters,(c)and(d),whichalmosttouch,insteadoftherectangularclusters,(a)and(b),whichareseparatedbyasmallgap.However,intuitively,weshouldhavemergedrectangularclusters,(a)and(b).Exercise15 onpage700asksforanexampleofasituationinwhichthestrengthofconnectionslikewiseleadstoanunintuitiveresult.

Figure8.17.Situationinwhichclosenessisnottheappropriatemergingcriterion.©1999,IEEE

Anotherproblemisthatmostclusteringtechniqueshaveaglobal(static)modelofclusters.Forinstance,K-meansassumesthattheclusterswillbeglobular,whileDBSCANdefinesclustersbasedonasingledensitythreshold.Clusteringschemesthatusesuchaglobalmodelcannothandlecasesinwhichclustercharacteristics,suchassize,shape,anddensity,varywidelybetweenclusters.Asanexampleoftheimportanceofthelocal(dynamic)modelingofclusters,considerFigure8.18 .Ifweusetheclosenessofclusterstodeterminewhichpairofclustersshouldbemerged,aswouldbethecaseifweused,forexample,thesinglelinkclusteringalgorithm,thenwewouldmergeclusters(a)and(b).However,wehavenottakenintoaccountthecharacteristicsofeachindividualcluster.Specifically,wehaveignoredthedensityoftheindividualclusters.Forclusters(a)and(b),whicharerelativelydense,thedistancebetweenthetwoclustersissignificantlylargerthanthedistancebetweenapointanditsnearestneighborswithinthesamecluster.

Thisisnotthecaseforclusters(c)and(d),whicharerelativelysparse.Indeed,whenclusters(c)and(d)aremerged,theyyieldaclusterthatseemsmoresimilartotheoriginalclustersthantheclusterthatresultsfrommergingclusters(a)and(b).

Figure8.18.Illustrationofthenotionofrelativecloseness.©1999,IEEE

Chameleonisanagglomerativeclusteringalgorithmthataddressestheissuesoftheprevioustwoparagraphs.Itcombinesaninitialpartitioningofthedata,usinganefficientgraphpartitioningalgorithm,withanovelhierarchicalclusteringschemethatusesthenotionsofclosenessandinterconnectivity,togetherwiththelocalmodelingofclusters.Thekeyideaisthattwoclustersshouldbemergedonlyiftheresultingclusterissimilartothetwooriginalclusters.Self-similarityisdescribedfirst,andthentheremainingdetailsoftheChameleonalgorithmarepresented.

DecidingWhichClusterstoMergeTheagglomerativehierarchicalclusteringtechniquesconsideredinSection7.3 repeatedlycombinethetwoclosestclustersandareprincipallydistinguishedfromoneanotherbythewaytheydefineclusterproximity.Incontrast,Chameleonaimstomergethepairofclustersthatresultsinaclusterthatismostsimilartotheoriginalpairofclusters,asmeasuredbycloseness

andinterconnectivity.Becausethisapproachdependsonlyonthepairofclustersandnotonaglobalmodel,Chameleoncanhandledatathatcontainsclusterswithwidelydifferentcharacteristics.

Followingaremoredetailedexplanationsofthepropertiesofclosenessandinterconnectivity.Tounderstandtheseproperties,itisnecessarytotakeaproximitygraphviewpointandtoconsiderthenumberofthelinksandthestrengthofthoselinksamongpointswithinaclusterandacrossclusters.

RelativeCloseness(RC)istheabsoluteclosenessoftwoclustersnormalizedbytheinternalclosenessoftheclusters.Twoclustersarecombinedonlyifthepointsintheresultingclusterarealmostasclosetoeachotherasineachoftheoriginalclusters.Mathematically,

where and arethesizesofclusters and respectively;istheaverageweightoftheedges(ofthek-nearestneighbor

graph)thatconnectclusters and istheaverageweightofedgesifwebisectcluster and istheaverageweightofedgesifwebisectcluster (ECstandsforedgecut.)Figure8.18 illustratesthenotionofrelativecloseness.Asdiscussedpreviously,whileclusters(a)and(b)arecloserinabsolutetermsthanclusters(c)and(d),thisisnottrueifthenatureoftheclustersistakenintoaccount.RelativeInterconnectivity(RI)istheabsoluteinterconnectivityoftwoclustersnormalizedbytheinternalconnectivityoftheclusters.Twoclustersarecombinedifthepointsintheresultingclusterarealmostasstronglyconnectedaspointsineachoftheoriginalclusters.Mathematically,

RC(Ci,Cj)=S¯EC(Ci,Cj)mimi+mjS¯EC(Ci)+mjmi+mjS¯EC(Cj), (8.17)

mi mj Ci Cj,S¯EC(Ci,Cj)

Ci Cj;S¯EC(Ci)Ci; S¯EC(Cj)

Cj.

RI(Ci,Cj)=EC(Ci,Cj)12(EC(Ci))+EC(Cj)), (8.18)

where isthesumoftheedges(ofthek-nearestneighborgraph)thatconnectclusters and istheminimumsumofthecutedgesifwebisectcluster and istheminimumsumofthecutedgesifwebisectcluster Figure8.19 illustratesthenotionofrelativeinterconnectivity.Thetwocircularclusters,(c)and(d),havemoreconnectionsthantherectangularclusters,(a)and(b).However,merging(c)and(d)producesaclusterthathasconnectivityquitedifferentfromthatof(c)and(d).Incontrast,merging(a)and(b)producesaclusterwithconnectivityverysimilartothatof(a)and(b).

Figure8.19.Illustrationofthenotionofrelativeinterconnectedness.©1999,IEEE

RIandRCcanbecombinedinmanydifferentwaystoyieldanoverallmeasureofself-similarity.OneapproachusedinChameleonistomergethepairofclustersthatmaximizes whereαisauser-specifiedparameterthatistypicallygreaterthan1.

ChameleonAlgorithm

EC(Ci,Cj)Ci Cj; EC(Ci)

Ci; EC(Cj)Cj.

RI(Ci,Cj)*RC(Ci,Cj)α,

Chameleonconsistsofthreekeysteps:sparsification,graphpartitioning,andhierarchicalclustering.Algorithm8.9 andFigure8.20 describethesesteps.

Figure8.20.OverallprocessbywhichChameleonperformsclustering.©1999,IEEE

Algorithm8.9Chameleonalgorithm.

Sparsification

ThefirststepinChameleonistogenerateak-nearestneighborgraph.Conceptually,suchagraphisderivedfromtheproximitygraph,anditcontainslinksonlybetweenapointanditsk-nearestneighbors,i.e.,thepointstowhichitisclosest.Asmentioned,workingwithasparsifiedproximitygraph

1:Buildak-nearestneighborgraph.2:Partitionthegraphusingamultilevelgraphpartitioningalgorithm.3:repeat4:Mergetheclustersthatbestpreservetheclusterself-similaritywithrespecttorelativeinterconnectivityandrelativecloseness.5:untilNomoreclusterscanbemerged.

insteadofthefullproximitygraphcansignificantlyreducetheeffectsofnoiseandoutliersandimprovecomputationalefficiency.

GraphPartitioningOnceasparsifiedgraphhasbeenobtained,anefficientmultilevelgraphpartitioningalgorithm,suchasMETIS(seeBibliographicNotes),canbeusedtopartitionthedataset.Chameleonstartswithanall-inclusivegraph(cluster)andthenbisectsthelargestcurrentsubgraph(cluster)untilnoclusterhasmorethan points,where isauser-specifiedparameter.Thisprocessresultsinalargenumberofroughlyequallysizedgroupsofwell-connectedvertices(highlysimilardatapoints).Thegoalistoensurethateachpartitioncontainsobjectsmostlyfromonetruecluster.

AgglomerativeHierarchicalClustering

Asdiscussedpreviously,Chameleonmergesclustersbasedonthenotionofself-similarity.Chameleoncanbeparameterizedtomergemorethanonepairofclustersinasinglestepandtostopbeforeallobjectshavebeenmergedintoasinglecluster.

Complexity

Assumethatmisthenumberofdatapointsandpisthenumberofpartitions.Performinganagglomerativehierarchicalclusteringoftheppartitionsobtainedfromthegraphpartitioningrequirestime (SeeSection7.3.1 .)Theamountoftimerequiredforpartitioningthegraphis

Thetimecomplexityofgraphsparsificationdependsonhowmuchtimeittakestobuildthek-nearestneighborgraph.Forlow-dimensionaldata,thistakes timeifak-dtreeorasimilartypeofdatastructure

O(p2logp).

O(mp+mlogm).

O(mlogm)

isused.Unfortunately,suchdatastructuresonlyworkwellforlow-dimensionaldatasets,andthus,forhigh-dimensionaldatasets,thetimecomplexityofthesparsificationbecomes Becauseonlythek-nearestneighborlistneedstobestored,thespacecomplexityis plusthespacerequiredtostorethedata.

Example8.11.ChameleonwasappliedtotwodatasetsthatclusteringalgorithmssuchasK-meansandDBSCANhavedifficultyclustering.TheresultsofthisclusteringareshowninFigure8.21 .Theclustersareidentifiedbytheshadingofthepoints.InFigure8.21(a) ,thetwoclustersareirregularlyshapedandquiteclosetoeachother.Also,noiseispresent.InFigure8.21(b) ,thetwoclustersareconnectedbyabridge,andagain,noiseispresent.Nonetheless,Chameleonidentifieswhatmostpeoplewouldidentifyasthenaturalclusters.Chameleonhasspecificallybeenshowntobeveryeffectiveforclusteringspatialdata.Finally,noticethatChameleondoesnotdiscardnoisepoints,asdootherclusteringschemes,butinsteadassignsthemtotheclusters.

Figure8.21.

O(m2).O(km)

Chameleonappliedtoclusterapairoftwo-dimensionalsetsofpoints.©1999,IEEE

StrengthsandLimitationsChameleoncaneffectivelyclusterspatialdata,eventhoughnoiseandoutliersarepresentandtheclustersareofdifferentshapes,sizes,anddensity.Chameleonassumesthatthegroupsofobjectsproducedbythesparsificationandgraphpartitioningprocessaresubclusters;i.e.,thatmostofthepointsinapartitionbelongtothesametruecluster.Ifnot,thenagglomerativehierarchicalclusteringwillonlycompoundtheerrorsbecauseitcanneverseparateobjectsthathavebeenwronglyputtogether.(SeethediscussioninSection7.3.4 .)Thus,Chameleonhasproblemswhenthepartitioningprocessdoesnotproducesubclusters,asisoftenthecaseforhigh-dimensionaldata.

8.4.5SpectralClustering

Spectralclusteringisanelegantgraphpartitioningapproachthatexploitspropertiesofthesimilaritygraphtodeterminetheclusterpartitions.Specifically,itexaminesthegraph’sspectrum,i.e.,eigenvaluesandeigenvectorsassociatedwiththeadjacencymatrixofthegraph,toidentifythenaturalclustersofthedata.Tomotivatetheideasbehindthisapproach,considerthesimilaritygraphshowninFigure8.22 foradatasetthatcontains6datapoints.Thelinkweightsinthegrapharecomputedbasedonsomesimilaritymeasure,withathresholdappliedtoremovelinkswithlowsimilarityvalues.Thesparsificationproducesagraphwithtwoconnectedcomponents,whichtriviallyrepresentthetwoclustersinthedata,and

{v1,v2,v3}{v4,v5,v6}.

Figure8.22.Exampleofasimilaritygraphwithtwoconnectedcomponentsalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.

Thetopright-handpanelofthefigurealsoshowstheweightedadjacencymatrixofthegraph,denotedasW,andadiagonalmatrix,D,whosediagonalelementscorrespondtothesumoftheweightsofthelinksincidenttoeachnodeinthegraph,i.e.,

Notethattherowsandcolumnsoftheweightedadjacencymatrixhavebeenorderedinsuchawaythatnodesbelongingtothesameconnectedcomponentarenexttoeachother.Withthisordering,thematrixWhasablockstructureoftheform

Dij={ΣkWik,ifi=j;0,otherwise.

W=(W100W2),

inwhichtheoff-diagonalblocksarematricesofzerovaluessincetherearenolinksconnectinganodefromthefirstconnectedcomponenttoanodefromthesecondconnectedcomponent.Indeed,ifthesparsegraphcontainskconnectedcomponents,itsweightedadjacencymatrixcanbere-orderedintothefollowingblockdiagonalform:

Thisexamplesuggeststhepossibilityofidentifyingtheinherentclustersofadatasetbyexaminingtheblockstructureofitsweightedadjacencymatrix.

Unfortunately,unlesstheclustersarewell-separated,theadjacencymatricesassociatedwithmostsimilaritygraphsarenotinblockdiagonalform.Forexample,considerthegraphshowninFigure8.23 ,inwhichthereisalinkbetweennodes and withalowsimilarityvalue.Ifweareinterestedingeneratingtwoclusters,wecouldbreaktheweakestlink,locatedbetween

tosplitthegraphintotwopartitions.Becausethereisonlyoneconnectedcomponentinthegraph,theblockstructureinWishardertodiscern.

W=(W10⋯00W2⋯0⋯⋯⋯⋯00⋯Wk), (8.19)

v3 v4,

(v3,v4),

Figure8.23.Exampleofasimilaritygraphwithasingleconnectedcomponentalongwithitsweightedadjacencymatrix(W),graphLaplacianmatrix(L),andeigendecomposition.

Fortunately,thereisamoreobjectivewaytocreatetheclusterpartitionsbyconsideringthegraphspectrum.First,weneedtocomputethegraphLaplacianmatrix,whichisformallydefinedasfollows:

ThegraphLaplacianmatricesfortheexamplesshowninFigures8.22 and8.23 aredepictedinthebottomleftpanelofbothdiagrams.Thematrixhasseveralnotableproperties:

1. ItisasymmetricmatrixsincebothWandDaresymmetric.2. Itisapositivesemi-definitematrix,whichmeans foranyinput

vectorv.3. AlleigenvaluesofLmustbenon-negative.Theeigenvaluesand

eigenvectorsforthegraphsshowninFigure8.22 and8.23 aredenotedinthediagramsasΛandV,respectively.NotethattheeigenvaluesofthegraphLaplacianmatrixaregivenbythediagonalelementsofΛ.

4. ThesmallesteigenvalueofLiszero,withthecorrespondingeigenvectore,whichisavectorof1s.Thisisbecause

Thus, whichisequivalentto Thiscanbesimplifiedintotheeigenvalueequation since

L=D−W (8.20)

vTLv≥0

We=(W11W12⋯W1nW21W22⋯W2n⋯⋯⋯⋯Wn1Wn2⋯Wnn)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)De=(ΣjW1j0⋯00ΣjW2j⋯0⋯⋯⋯⋯00⋯ΣjWnj)(11⋯1)=(ΣjW1jΣjW2j⋯ΣjWnj)

We=De, (D−W)e=0.Le=0e L=D−W.

5. AgraphwithkconnectedcomponentshasanadjacencymatrixWinblockdiagonalformasshowninEquation8.19 .ItsgraphLaplacianmatrixalsohasablockdiagonalform

Inaddition,itsgraphLaplacianmatrixhaskeigenvaluesofzeros,withthecorrespondingeigenvectors

wherethe ’sarevectorsof1’sand0’sarevectorsof0’s.Forexample,thegraphshowninFigure8.22 containstwoconnectedcomponents,whichiswhyitsgraphLaplacianmatrixhastwoeigenvaluesofzeros.Moreimportantly,itsfirsttwoeigenvectors(normalizedtounitlength),

correspondingtothefirsttwocolumnsinV,provideinformationabouttheclustermembershipofeachnode.Anodethatbelongtothefirstclusterhasapositivevalueintheitsfirsteigenvectorandazerovalueinitssecondeigenvector,whereasanodethatbelongtothesecondclusterhasazerovalueinthefirsteigenvectorandanegativevalueinthesecondeigenvector.

ThegraphshowninFigure8.23 hasoneeigenvalueofzerobecauseithasonlyoneconnectedcomponent.Nevertheless,ifweexaminethefirsttwoeigenvectorsofitsgraphLaplacianmatrix

thegraphcanbeeasilysplitintotwoclusterssincethesetofnodeshasanegativevalueinthesecondeigenvectorwhereas hasa

L=(L10⋯00L2⋯0⋯⋯⋯⋯00⋯Lk),

(e10⋯0),(0e2⋯0),⋯,(00⋯ek),

ei

v1→v2→v3→v4→v5→v6→(0.5800.5800.5800−0.580−0.580−0.58),

v1→v2→v3→v4→v5→v6→(0.41−0.410.41−0.430.41−0.380.410.380.410.420.41

{v1,v2,v3}{v4,v5,v6}

positivevalueinthesecondeigenvector.Inshort,theeigenvectorsofthegraphLaplacianmatrixcontaininformationthatcanbeusedtopartitionthegraphintoitsunderlyingcomponents.However,insteadofmanuallycheckingtheeigenvectors,itiscommonpracticetoapplyasimpleclusteringalgorithmsuchasK-meanstohelpextracttheclustersfromtheeigenvectors.AsummaryofthespectralclusteringalgorithmisgiveninAlgorithm8.10 .

Algorithm8.10Spectralclustering

algorithm.

Example8.12.Considerthetwo-dimensionalringdatashowninFigure8.24(b) ,whichcontains350datapoints.Thefirst100pointsbelongtotheinnerringwhiletheremaining250pointsbelongtotheouterring.AheatmapshowingtheEuclideandistancebetweeneverypairofpointsisdepictedinFigure8.24(a) .Whilethepointsintheinnerringarerelativelyclosetoeachother,thoselocatedintheouterringcanbequitefarfromeachother.Asaresult,standardclusteringalgorithmssuchasK-meansperformpoorlyonthedata.Incontrast,applyingspectralclusteringonthesparsifiedsimilaritygraphcanproducethecorrectclusteringresults(seeFigure8.24(d) ).Here,thesimilaritybetweenpointsiscalculatedusingtheGaussianradialbasisfunctionandthegraphissparsifiedbychoosingthe10-nearestneighborsforeachdatapoint.Thesparsificationreducesthe

1:CreateasparsifiedsimilaritygraphG.2:ComputethegraphLaplacianforG,L(seeEquation(8.20)).

3:CreateamatrixVfromthefirstkeigenvectorsofL.4:ApplyK-meansclusteringonVtoobtainthekclusters.

similaritybetweenadatapointlocatedintheinnerringandacorrespondingpointintheouterring,whichenablesspectralclusteringtoeffectivelypartitionthedatasetintotwoclusters.

Figure8.24.ApplicationofK-meansandspectralclusteringtoatwo-dimensionalringdata.

RelationshipbetweenSpectralClusteringandGraphPartitioningTheobjectiveofgraphpartitioningistobreaktheweaklinksinagraphuntilthedesirednumberofclusterpartitionsisobtained.Onewaytoassessthequalityofthepartitionsisbysumminguptheweightsofthelinksthatwereremoved.Theresultingmeasureisknownasgraphcut.Unfortunately,minimizingthegraphcutofthepartitionsaloneisinsufficientasittendstoproduceclusterswithhighlyimbalancedsizes.Forexample,considerthegraphshowninFigure8.25 .Supposeweareinterestedinpartitioningthegraphintotwoconnectedcomponents.Thegraphcutmeasurepreferstobreakthelinkbetween and becauseithasthelowestweight.

Figure8.25.Exampletoillustratethelimitationofusinggraphcutasevaluationmeasureforgraphpartitioning.

Unfortunately,suchasplitwouldcreateoneclusterwithasingleisolatednodeandanotherclustercontainingalltheremainingnodes.Toovercomethislimitation,alternativemeasureshavebeenproposedincluding

v4 v5

Ratiocut(C1,C2,⋯,Ck)=12∑i=1kΣp∈Ci,q∉CiWpq|Ci|,

where denotetheclusterpartitions.Thenumeratorrepresentsthesumoftheweightsofthebrokenlinks,i.e.,thegraphcut,whilethedenominatorrepresentsthesizeofeachclusterpartition.Suchameasurecanbeusedtoensurethattheresultingclustersaremorebalancedintermsoftheirsizes.Moreimportantly,itcanbeshownthatminimizingtheratiocutforagraphisequivalenttofindingaclustermembershipmatrixYthatminimizestheexpression where denotesthetraceofamatrixandListhegraphLaplacian,subjecttotheconstraint ByrelaxingtherequirementthatYisabinarymatrix,wecanusetheLagrangemultipliermethodtosolvetheoptimizationproblem.

Inotherwords,anapproximatesolutiontotheratiocutminimizationproblemcanbeobtainedbyfindingtheeigenvectorsofthegraphLaplacianmatrix,whichisexactlytheapproachusedbyspectralclustering.

StrengthsandLimitationsAsshowninExample8.12 ,thestrengthofspectralclusteringliesinitsabilitytodetectclustersofvaryingsizesandshapes.However,theclusteringperformancedependsonhowthesimilaritygraphiscreatedandsparsified.Inparticular,tuningtheparametersofthesimilarityfunction(e.g.,Gaussianradialbasisfunction)toproduceanappropriatesparsegraphforspectralclusteringcanbequiteachallenge.ThetimecomplexityofthealgorithmdependsonhowfasttheeigenvectorsofthegraphLaplacianmatrixcanbecomputed.Efficienteigensolversforsparsematricesareavailable,e.g.,thosebasedonKrylovsubspacemethods,especiallywhenthenumberofclusterschosenissmall.Thestoragecomplexityis thoughitcanbesignificantlyreducedusingasparserepresentationforthegraphLaplacianmatrix.Inmanyways,spectralclusteringbehavessimilarlytotheK-means

C1,C2,⋯,Ck

Tr[YTLY], Tr[⋅]YTY=I.

Lagrangian, ℒ=Tr[YTLY]−λ(Tr[YTY−I])∂ℒ∂Y=LY−λY=0⇒LY=λY

O(N2),

clusteringalgorithm.First,theybothrequiretheusertospecifythenumberofclustersasinputparameter.Bothmethodsarealsosusceptibletothepresenceofoutliers,whichtendtoformtheirownconnectedcomponents(clusters).Thus,preprocessingorpostprocessingmethodswillbeneededtohandleoutliersinthedata.

8.4.6SharedNearestNeighborSimilarity

Insomecases,clusteringtechniquesthatrelyonstandardapproachestosimilarityanddensitydonotproducethedesiredclusteringresults.Thissectionexaminesthereasonsforthisandintroducesanindirectapproachtosimilaritythatisbasedonthefollowingprinciple:

Iftwopointsaresimilartomanyofthesamepoints,thentheyaresimilartooneanother,evenifa

directmeasurementofsimilaritydoesnotindicatethis.

WemotivatethediscussionbyfirstexplainingtwoproblemsthatanSNNversionofsimilarityaddresses:lowsimilarityanddifferencesindensity.

ProblemswithTraditionalSimilarityinHigh-DimensionalDataInhigh-dimensionalspaces,itisnotunusualforsimilaritytobelow.Consider,forexample,asetofdocumentssuchasacollectionofnewspaperarticlesthatcomefromavarietyofsectionsofthenewspaper:Entertainment,Financial,Foreign,Metro,National,andSports.AsexplainedinChapter2 ,thesedocumentscanbeviewedasvectorsinahigh-dimensionalspace,

whereeachcomponentofthevector(attribute)recordsthenumberoftimesthateachwordinavocabularyoccursinadocument.Thecosinesimilaritymeasureisoftenusedtoassessthesimilaritybetweendocuments.Forthisexample,whichcomesfromacollectionofarticlesfromtheLosAngelesTimes,Table8.3 givestheaveragecosinesimilarityineachsectionandamongtheentiresetofdocuments.

Table8.3.Similarityamongdocumentsindifferentsectionsofanewspaper.

Section AverageCosineSimilarity

Entertainment 0.032

Financial 0.030

Foreign 0.030

Metro 0.021

National 0.027

Sports 0.036

AllSections 0.014

Thesimilarityofeachdocumenttoitsmostsimilardocument(thefirstnearestneighbor)isbetter,0.39onaverage.However,aconsequenceoflowsimilarityamongobjectsofthesameclassisthattheirnearestneighborisoftennotofthesameclass.InthecollectionofdocumentsfromwhichTable8.3 wasgenerated,about20%ofthedocumentshaveanearestneighborofadifferentclass.Ingeneral,ifdirectsimilarityislow,thenitbecomesanunreliableguideforclusteringobjects,especiallyforagglomerativehierarchicalclustering,wheretheclosestpointsareputtogetherandcannot

beseparatedafterward.Nonetheless,itisstillusuallythecasethatalargemajorityofthenearestneighborsofanobjectbelongtothesameclass;thisfactcanbeusedtodefineaproximitymeasurethatismoresuitableforclustering.

ProblemswithDifferencesinDensityAnotherproblemrelatestodifferencesindensitiesbetweenclusters.Figure8.26 showsapairoftwo-dimensionalclustersofpointswithdifferingdensity.Thelowerdensityoftherightmostclusterisreflectedinaloweraveragedistanceamongthepoints.Eventhoughthepointsinthelessdenseclusterformanequallyvalidcluster,typicalclusteringtechniqueswillhavemoredifficultyfindingsuchclusters.Also,normalmeasuresofcohesion,suchasSSE,willindicatethattheseclustersarelesscohesive.Toillustratewitharealexample,thestarsinagalaxyarenolessrealclustersofstellarobjectsthantheplanetsinasolarsystem,eventhoughtheplanetsinasolarsystemareconsiderablyclosertooneanotheronaverage,thanthestarsinagalaxy.

Figure8.26.Twocircularclustersof200uniformlydistributedpoints.

SNNSimilarityComputationInbothsituations,thekeyideaistotakethecontextofpointsintoaccountindefiningthesimilaritymeasure.ThisideacanbemadequantitativebyusingasharednearestneighbordefinitionofsimilarityinthemannerindicatedbyAlgorithm8.11 .Essentially,theSNNsimilarityisthenumberofsharedneighborsaslongasthetwoobjectsareoneachother’snearestneighborlists.Notethattheunderlyingproximitymeasurecanbeanymeaningfulsimilarityordissimilaritymeasure.

Algorithm8.11Computingsharednearest

neighborsimilarity

ThecomputationofSNNsimilarityisdescribedbyAlgorithm8.11 andgraphicallyillustratedbyFigure8.27 .Eachofthetwoblackpointshaseightnearestneighbors,includingeachother.Fourofthosenearestneighbors—thepointsingray—areshared.Thus,thesharednearestneighborsimilaritybetweenthetwopointsis4.

1:Findthek-nearestneighborsofallpoints.2:iftwopoints,xandy,arenotamongthek-nearestneighborsofeachotherthen3:similarity4:else5:similarity6:endif

(x,y)←0

(x,y)←numberofsharedneighbors

Figure8.27.ComputationofSNNsimilaritybetweentwopoints.

ThesimilaritygraphoftheSNNsimilaritiesamongobjectsiscalledtheSNNsimilaritygraph.BecausemanypairsofobjectswillhaveanSNNsimilarityof0,thisisaverysparsegraph.

SNNSimilarityversusDirectSimilaritySNNsimilarityisusefulbecauseitaddressessomeoftheproblemsthatoccurwithdirectsimilarity.First,sinceittakesintoaccountthecontextofanobjectbyusingthenumberofsharednearestneighbors,SNNsimilarityhandlesthesituationinwhichanobjecthappenstoberelativelyclosetoanotherobject,butbelongstoadifferentclass.Insuchcases,theobjectstypicallydonotsharemanynearneighborsandtheirSNNsimilarityislow.

SNNsimilarityalsoaddressesproblemswithclustersofvaryingdensity.Inalow-densityregion,theobjectsarefartherapartthanobjectsindenserregions.However,theSNNsimilarityofapairofpointsonlydependsonthenumberofnearestneighborstwoobjectsshare,nothowfartheseneighborsarefromeachobject.Thus,SNNsimilarityperformsanautomaticscalingwithrespecttothedensityofthepoints.

8.4.7TheJarvis-PatrickClustering

Algorithm

Algorithm8.12 expressestheJarvis-Patrickclusteringalgorithmusingtheconceptsofthelastsection.TheJPclusteringalgorithmreplacestheproximitybetweentwopointswiththeSNNsimilarity,whichiscalculatedasdescribedinAlgorithm8.11 .AthresholdisthenusedtosparsifythismatrixofSNNsimilarities.Ingraphterms,anSNNsimilaritygraphiscreatedandsparsified.ClustersaresimplytheconnectedcomponentsoftheSNNgraph.

Algorithm8.12Jarvis-Patrickclustering

algorithm.

ThestoragerequirementsoftheJPclusteringalgorithmareonlybecauseitisnotnecessarytostoretheentiresimilaritymatrix,eveninitially.ThebasictimecomplexityofJPclusteringis ,sincethecreationofthek-nearestneighborlistcanrequirethecomputationofproximities.However,forcertaintypesofdata,suchaslow-dimensionalEuclideandata,specialtechniques,e.g.,ak-dtree,canbeusedtomoreefficientlyfindthek-nearestneighborswithoutcomputingtheentiresimilaritymatrix.Thiscanreducethetimecomplexityfrom to .

Example8.13(JPClusteringofaTwo-

1:ComputetheSNNsimilaritygraph.2:SparsifytheSNNsimilaritygraphbyapplyingasimilaritythreshold.3:Findtheconnectedcomponents(clusters)ofthesparsifiedSNNsimilaritygraph.

O(km)

O(m)2O(m)2

O(m)2 O(mlogm)

DimensionalDataSet).WeappliedJPclusteringtothe“fish”datasetshowninFigure8.28(a)tofindtheclustersshowninFigure8.28(b) .Thesizeofthenearestneighborlistwas20,andtwopointswereplacedinthesameclusteriftheysharedatleast10points.Thedifferentclustersareshownbythedifferentmarkersanddifferentshading.Thepointswhosemarkerisan“x”wereclassifiedasnoisebyJarvis-Patrick.Theyaremostlyinthetransitionregionsbetweenclustersofdifferentdensity.

Figure8.28.Jarvis-Patrickclusteringofatwo-dimensionalpointset.

StrengthsandLimitationsBecauseJPclusteringisbasedonthenotionofSNNsimilarity,itisgoodatdealingwithnoiseandoutliersandcanhandleclustersofdifferentsizes,

shapes,anddensities.Thealgorithmworkswellforhigh-dimensionaldataandisparticularlygoodatfindingtightclustersofstronglyrelatedobjects.

However,JPclusteringdefinesaclusterasaconnectedcomponentintheSNNsimilaritygraph.Thus,whetherasetofobjectsissplitintotwoclustersorleftasonecandependonasinglelink.Hence,JPclusteringissomewhatbrittle;i.e.,itcansplittrueclustersorjoinclustersthatshouldbekeptseparate.

Anotherpotentiallimitationisthatnotallobjectsareclustered.However,theseobjectscanbeaddedtoexistingclusters,andinsomecases,thereisnorequirementforacompleteclustering.JPclusteringhasabasictimecomplexityof ,whichisthetimerequiredtocomputethenearestneighborlistforasetofobjectsinthegeneralcase.Incertaincases,e.g.,low-dimensionaldata,specialtechniquescanbeusedtoreducethetimecomplexityforfindingnearestneighborsto .Finally,aswithotherclusteringalgorithms,choosingthebestvaluesfortheparameterscanbechallenging.

8.4.8SNNDensity

Asdiscussedintheintroductiontothischapter,traditionalEuclideandensitybecomesmeaninglessinhighdimensions.Thisistruewhetherwetakeagrid-basedview,suchasthatusedbyCLIQUE,acenter-basedview,suchasthatusedbyDBSCAN,orakernel-densityestimationapproach,suchasthatusedbyDENCLUE.Itispossibletousethecenter-baseddefinitionofdensitywithasimilaritymeasurethatworkswellforhighdimensions,e.g.,cosineorJaccard,butasdescribedinSection8.4.6 ,suchmeasuresstillhaveproblems.However,becausetheSNNsimilaritymeasurereflectsthelocal

O(m)2

O(mlogm)

configurationofthepointsinthedataspace,itisrelativelyinsensitivetovariationsindensityandthedimensionalityofthespace,andisapromisingcandidateforanewmeasureofdensity.

ThissectionexplainshowtodefineaconceptofSNNdensitybyusingSNNsimilarityandfollowingtheDBSCANapproachdescribedinSection7.4 .Forclarity,thedefinitionsofthatsectionarerepeated,withappropriatemodificationtoaccountforthefactthatweareusingSNNsimilarity.

Corepoints.Apointisacorepointifthenumberofpointswithinagivenneighborhoodaroundthepoint,asdeterminedbySNNsimilarityandasuppliedparameterEpsexceedsacertainthresholdMinPts,whichisalsoasuppliedparameter.

Borderpoints.Aborderpointisapointthatisnotacorepoint,i.e.,therearenotenoughpointsinitsneighborhoodforittobeacorepoint,butitfallswithintheneighborhoodofacorepoint.

Noisepoints.Anoisepointisanypointthatisneitheracorepointnoraborderpoint.

SNNdensitymeasuresthedegreetowhichapointissurroundedbysimilarpoints(withrespecttonearestneighbors).Thus,pointsinregionsofhighandlowdensitywilltypicallyhaverelativelyhighSNNdensity,whilepointsinregionswherethereisatransitionfromlowtohighdensity—pointsthatarebetweenclusters—willtendtohavelowSNNdensity.Suchanapproachiswell-suitedfordatasetsinwhichtherearewidevariationsindensity,butclustersoflowdensityarestillinteresting.

Example8.14(Core,Border,andNoisePoints).

TomaketheprecedingdiscussionofSNNdensitymoreconcrete,weprovideanexampleofhowSNNdensitycanbeusedtofindcorepointsandremovenoiseandoutliers.Thereare10,000pointsinthe2DpointdatasetshowninFigure8.29(a) .Figures8.29(b–d) distinguishbetweenthesepointsbasedontheirSNNdensity.Figure8.29(b) showsthepointswiththehighestSNNdensity,whileFigure8.29(c) showspointsofintermediateSNNdensity,andFigure8.29(d) showsfiguresofthelowestSNNdensity.Fromthesefigures,weseethatthepointsthathavehighdensity(i.e.,highconnectivityintheSNNgraph)arecandidatesforbeingrepresentativeorcorepointssincetheytendtobelocatedwellinsidethecluster,whilethepointsthathavelowconnectivityarecandidatesforbeingnoisepointsandoutliers,astheyaremostlyintheregionssurroundingtheclusters.

Figure8.29.SNNdensityoftwo-dimensionalpoints.

8.4.9SNNDensity-BasedClustering

TheSNNdensitydefinedabovecanbecombinedwiththeDBSCANalgorithmtocreateanewclusteringalgorithm.ThisalgorithmissimilartotheJPclusteringalgorithminthatitstartswiththeSNNsimilaritygraph.However,

insteadofusingathresholdtosparsifytheSNNsimilaritygraphandthentakingconnectedcomponentsasclusters,theSNNdensity-basedclusteringalgorithmsimplyappliesDBSCAN.

TheSNNDensity-basedClusteringAlgorithmThestepsoftheSNNdensity-basedclusteringalgorithmareshowninAlgorithm8.13 .

Algorithm8.13SNNdensity-based

clusteringalgorithm.

Thealgorithmautomaticallydeterminesthenumberofclustersinthedata.Notethatnotallthepointsareclustered.Thepointsthatarediscardedincludenoiseandoutliers,aswellaspointsthatarenotstronglyconnectedtoagroupofpoints.SNNdensity-basedclusteringfindsclustersinwhichthepointsarestronglyrelatedtooneanother.Dependingontheapplication,wemightwanttodiscardmanyofthepoints.Forexample,SNNdensity-basedclusteringisgoodforfindingtopicsingroupsofdocuments.

Example8.15(SNNDensity-basedClusteringofTimeSeries).TheSNNdensity-basedclusteringalgorithmpresentedinthissectionismoreflexiblethanJarvis-PatrickclusteringorDBSCAN.UnlikeDBSCAN,it

1:ComputetheSNNsimilaritygraph.2:ApplyDBSCANwithuser-specifiedparametersforEpsandMinPts.

canbeusedforhigh-dimensionaldataandsituationsinwhichtheclustershavedifferentdensities.UnlikeJarvis-Patrick,whichperformsasimplethresholdingandthentakestheconnectedcomponentsasclusters,SNNdensity-basedclusteringusesalessbrittleapproachthatreliesontheconceptsofSNNdensityandcorepoints.

TodemonstratethecapabilitiesofSNNdensity-basedclusteringonhigh-dimensionaldata,weappliedittomonthlytimeseriesdataofatmosphericpressureatvariouspointsontheEarth.Morespecifically,thedataconsistsoftheaveragemonthlysea-levelpressure(SLP)foraperiodof41yearsateachpointona longitude-latitudegrid.TheSNNdensity-basedclusteringalgorithmfoundtheclusters(grayregions)indicatedinFigure8.30 .Notethattheseareclustersoftimeseriesoflength492months,eventhoughtheyarevisualizedastwo-dimensionalregions.Thewhiteareasareregionsinwhichthepressurewasnotasuniform.Theclustersnearthepolesareelongatedbecauseofthedistortionofmappingasphericalsurfacetoarectangle.

2.5°

Figure8.30.ClustersofpressuretimeseriesfoundusingSNNdensity-basedclustering.

UsingSLP,Earthscientistshavedefinedtimeseries,calledclimateindices,whichareusefulforcapturingthebehaviorofphenomenainvolvingtheEarth’sclimate.Forexample,anomaliesinclimateindicesarerelatedtoabnormallyloworhighprecipitationortemperatureinvariouspartsoftheworld.SomeoftheclustersfoundbySNNdensity-basedclusteringhaveastrongconnectiontosomeoftheclimateindicesknowntoEarthscientists.

Figure8.31 showstheSNNdensitystructureofthedatafromwhichtheclusterswereextracted.Thedensityhasbeennormalizedtobeonascale

between0and1.Thedensityofatimeseriesmayseemlikeanunusualconcept,butitmeasuresthedegreetowhichthetimeseriesanditsnearestneighborshavethesamenearestneighbors.Becauseeachtimeseriesisassociatedwithalocation,itispossibletoplotthesedensitiesonatwo-dimensionalplot.Becauseoftemporalautocorrelation,thesedensitiesformmeaningfulpatterns,e.g.,itispossibletovisuallyidentifytheclustersofFigure8.31 .

Figure8.31.SNNdensityofpressuretimeseries.

StrengthsandLimitations

ThestrengthsandlimitationsofSNNdensity-basedclusteringaresimilartothoseofJPclustering.However,theuseofcorepointsandSNNdensityaddsconsiderablepowerandflexibilitytothisapproach.

8.5ScalableClusteringAlgorithmsEventhebestclusteringalgorithmisoflittlevalueifittakesanunacceptablylongtimetoexecuteorrequirestoomuchmemory.Thissectionexaminesclusteringtechniquesthatplacesignificantemphasisonscalabilitytotheverylargedatasetsthatarebecomingincreasinglycommon.Westartbydiscussingsomegeneralstrategiesforscalability,includingapproachesforreducingthenumberofproximitycalculations,samplingthedata,partitioningthedata,andclusteringasummarizedrepresentationofthedata.Wethendiscusstwospecificexamplesofscalableclusteringalgorithms:CUREandBIRCH.

8.5.1Scalability:GeneralIssuesandApproaches

Theamountofstoragerequiredformanyclusteringalgorithmsismorethanlinear;e.g.,withhierarchicalclustering,memoryrequirementsareusually

,wheremisthenumberofobjects.For10,000,000objects,forexample,theamountofmemoryrequiredisproportionalto ,anumberstillwellbeyondthecapacitiesofcurrentsystems.Notethatbecauseoftherequirementforrandomdataaccess,manyclusteringalgorithmscannoteasilybemodifiedtoefficientlyusesecondarystorage(disk),forwhichrandomdataaccessisslow.Likewise,theamountofcomputationrequiredforsomeclusteringalgorithmsismorethanlinear.Intheremainderofthissection,wediscussavarietyoftechniquesforreducingtheamountofcomputationand

O(m)2

104

storagerequiredbyaclusteringalgorithm.CUREandBIRCHusesomeofthesetechniques.

MultidimensionalorSpatialAccessMethodsManytechniques,suchasK-means,JarvisPatrickclustering,andDBSCAN,needtofindtheclosestcentroid,thenearestneighborsofapoint,orallpointswithinaspecifieddistance.Itispossibletousespecialtechniquescalledmultidimensionalorspatialaccessmethodstomoreefficientlyperformthesetasks,atleastforlow-dimensionaldata.Thesetechniques,suchasthek-dtreeorR*-tree,typicallyproduceahierarchicalpartitionofthedataspacethatcanbeusedtoreducethetimerequiredtofindthenearestneighborsofapoint.Notethatgrid-basedclusteringschemesalsopartitionthedataspace.

BoundsonProximitiesAnotherapproachtoavoidingproximitycomputationsistouseboundsonproximities.Forinstance,whenusingEuclideandistance,itispossibletousethetriangleinequalitytoavoidmanydistancecalculations.Toillustrate,ateachstageoftraditionalK-means,itisnecessarytoevaluatewhetherapointshouldstayinitscurrentclusterorbemovedtoanewcluster.Ifweknowthedistancebetweenthecentroidsandthedistanceofapointtothe(newlyupdated)centroidoftheclustertowhichitcurrentlybelongs,thenwemightbeabletousethetriangleinequalitytoavoidcomputingthedistanceofthepointtoanyoftheothercentroids.SeeExercise21 onpage702.

SamplingAnotherapproachtoreducingthetimecomplexityistosample.Inthisapproach,asampleofpointsistaken,thesepointsareclustered,andthentheremainingpointsareassignedtotheexistingclusters—typicallytotheclosestcluster.Ifthenumberofpointssampledis ,thenthetimecomplexityofan algorithmisreducedto Akeyproblemwithsampling,though,isthatsmallclusterscanbelost.WhenwediscussCURE,

mO(m)2 O(m).

wewillprovideatechniqueforinvestigatinghowfrequentlysuchproblemsoccur.

PartitioningtheDataObjectsAnothercommonapproachtoreducingtimecomplexityistousesomeefficienttechniquetopartitionthedataintodisjointsetsandthenclusterthesesetsseparately.Thefinalsetofclusterseitheristheunionoftheseseparatesetsofclustersorisobtainedbycombiningand/orrefiningtheseparatesetsofclusters.WeonlydiscussbisectingK-means(Section7.2.3 )inthissection,althoughmanyotherapproachesbasedonpartitioningarepossible.Onesuchapproachwillbedescribed,whenwedescribeCURElateroninthissection.

IfK-meansisusedtofindKclusters,thenthedistanceofeachpointtoeachclustercentroidiscalculatedateachiteration.WhenKislarge,thiscanbeveryexpensive.BisectingK-meansstartswiththeentiresetofpointsandusesK-meanstorepeatedlybisectanexistingclusteruntilwehaveobtainedKclusters.Ateachstep,thedistanceofpointstotwoclustercentroidsiscomputed.Exceptforthefirststep,inwhichtheclusterbeingbisectedconsistsofallthepoints,weonlycomputethedistanceofasubsetofpointstothetwocentroidsbeingconsidered.Becauseofthisfact,bisectingK-meanscanrunsignificantlyfasterthanregularK-means.

SummarizationAnotherapproachtoclusteringistosummarizethedata,typicallyinasinglepass,andthenclusterthesummarizeddata.Inparticular,theleaderalgorithm(seeExercise12 onpage605)eitherputsadataobjectintheclosestcluster(ifthatclusterissufficientlyclose)orstartsanewclusterthatcontainsthecurrentobject.Thisalgorithmislinearinthenumberofobjectsandcanbeusedtosummarizethedatasothatotherclusteringtechniquescanbeused.TheBIRCHalgorithmusesasimilarconcept.

ParallelandDistributedComputationIfitisnotpossibletotakeadvantageofthetechniquesdescribedearlier,oriftheseapproachesdonotyieldthedesiredaccuracyorreductionincomputationtime,thenotherapproachesareneeded.Ahighlyeffectiveapproachistodistributethecomputationamongmultipleprocessors.

8.5.2BIRCH

BIRCH(BalancedIterativeReducingandClusteringusingHierarchies)isahighlyefficientclusteringtechniquefordatainEuclideanvectorspaces,i.e.,dataforwhichaveragesmakesense.BIRCHcanefficientlyclustersuchdatawithonepassandcanimprovethatclusteringwithadditionalpasses.BIRCHcanalsodealeffectivelywithoutliers.

BIRCHisbasedonthenotionofaClusteringFeature(CF)andaCFtree.Theideaisthataclusterofdatapoints(vectors)canberepresentedbyatripleofnumbers(N,LS,SS),whereNisthenumberofpointsinthecluster,LSisthelinearsumofthepoints,andSSisthesumofsquaresofthepoints.Thesearecommonstatisticalquantitiesthatcanbeupdatedincrementallyandthatcanbeusedtocomputeanumberofimportantquantities,suchasthecentroidofaclusteranditsvariance(standarddeviation).Thevarianceisusedasameasureofthediameterofacluster.

Thesequantitiescanalsobeusedtocomputethedistancebetweenclusters.Thesimplestapproachistocalculatean (cityblock)or (Euclidean)distancebetweencentroids.Wecanalsousethediameter(variance)ofthemergedclusterasadistance.AnumberofdifferentdistancemeasuresforclustersaredefinedbyBIRCH,butallcanbecomputedusingthesummarystatistics.

L1 L2

ACFtreeisaheight-balancedtree.Eachinteriornodehasentriesoftheform,where isapointertothe childnode.Thespacethat

eachentrytakesandthepagesizedeterminethenumberofentriesinaninteriornode.Thespaceofeachentryis,inturn,determinedbythenumberofattributesofeachpoint.

Leafnodesconsistofasequenceofclusteringfeatures, ,whereeachclusteringfeaturerepresentsanumberofpointsthathavebeenpreviouslyscanned.Leafnodesaresubjecttotherestrictionthateachleafnodemusthaveadiameterthatislessthanaparameterizedthreshold,T.Thespacethateachentrytakes,togetherwiththepagesize,determinesthenumberofentriesinaleaf.

ByadjustingthethresholdparameterT,theheightofthetreecanbecontrolled.Tcontrolsthefinenessoftheclustering,i.e.,theextenttowhichthedataintheoriginalsetofdataisreduced.ThegoalistokeeptheCFtreeinmainmemorybyadjustingtheTparameterasnecessary.

ACFtreeisbuiltasthedataisscanned.Aseachdatapointisencountered,theCFtreeistraversed,startingfromtherootandchoosingtheclosestnodeateachlevel.Whentheclosestleafclusterforthecurrentdatapointisfinallyidentified,atestisperformedtoseeifaddingthedataitemtothecandidateclusterwillresultinanewclusterwithadiametergreaterthanthegiventhreshold,T.Ifnot,thenthedatapointisaddedtothecandidateclusterbyupdatingtheCFinformation.Theclusterinformationforallnodesfromtheleaftotherootisalsoupdated.

IfthenewclusterhasadiametergreaterthanT,thenanewentryiscreatediftheleafnodeisnotfull.Otherwisetheleafnodemustbesplit.Thetwoentries(clusters)thatarefarthestapartareselectedasseedsandtheremainingentriesaredistributedtooneofthetwonewleafnodes,basedonwhichleaf

[CFi, childi] childi ith

CFi

nodecontainstheclosestseedcluster.Oncetheleafnodehasbeensplit,theparentnodeisupdatedandsplitifnecessary;i.e.,iftheparentnodeisfull.Thisprocessmaycontinueallthewaytotherootnode.

BIRCHfollowseachsplitwithamergestep.Attheinteriornodewherethesplitstops,thetwoclosestentriesarefound.Iftheseentriesdonotcorrespondtothetwoentriesthatjustresultedfromthesplit,thenanattemptismadetomergetheseentriesandtheircorrespondingchildnodes.Thisstepisintendedtoincreasespaceutilizationandavoidproblemswithskeweddatainputorder.

BIRCHalsohasaprocedureforremovingoutliers.Whenthetreeneedstoberebuiltbecauseithasrunoutofmemory,thenoutlierscanoptionallybewrittentodisk.(Anoutlierisdefinedtobeanodethathasfarfewerdatapointsthanaverage.)Atcertainpointsintheprocess,outliersarescannedtoseeiftheycanbeabsorbedbackintothetreewithoutcausingthetreetogrowinsize.Ifso,theyarereabsorbed.Ifnot,theyaredeleted.

BIRCHconsistsofanumberofphasesbeyondtheinitialcreationoftheCFtree.AllthephasesofBIRCHaredescribedbrieflyinAlgorithm8.14 .

Algorithm8.14BIRCH.1:LoadthedataintomemorybycreatingaCFtreethatsummarizesthedata.2:BuildasmallerCFtreeifitisnecessaryforphase3.Tisincreased,andthentheleafnodeentries(clusters)arereinserted.SinceThasincreased,someclusterswillbemerged.3:Performglobalclustering.Differentformsofglobalclustering(clusteringthatusesthepairwisedistancesbetween

8.5.3CURE

CURE(ClusteringUsingREpresentatives)isaclusteringalgorithmthatusesavarietyofdifferenttechniquestocreateanapproachthatcanhandlelargedatasets,outliers,andclusterswithnon-sphericalshapesandnon-uniformsizes.CURErepresentsaclusterbyusingmultiplerepresentativepointsfromthecluster.Thesepointswill,intheory,capturethegeometryandshapeofthecluster.Thefirstrepresentativepointischosentobethepointfarthestfromthecenterofthecluster,whiletheremainingpointsarechosensothattheyarefarthestfromallthepreviouslychosenpoints.Inthisway,the

alltheclusters)canbeused.However,anagglomerative,hierarchicaltechniquewasselected.Becausetheclusteringfeaturesstoresummaryinformationthatisimportanttocertainkindsofclustering,theglobalclusteringalgorithmcanbeappliedasifitwerebeingappliedtoallthepointsinaclusterrepresentedbytheCF.4:Redistributethedatapointsusingthecentroidsofclustersdiscoveredinstep3,andthus,discoveranewsetofclusters.ThisovercomescertainproblemsthatcanoccurinthefirstphaseofBIRCH.BecauseofpagesizeconstraintsandtheTparameter,pointsthatshouldbeinoneclusteraresometimessplit,andpointsthatshouldbeindifferentclustersaresometimescombined.Also,ifthedatasetcontainsduplicatepoints,thesepointscansometimesbeclustereddifferently,dependingontheorderinwhichtheyareencountered.Byrepeatingthisphasemultipletimes,theprocessconvergestoalocallyoptimalsolution.

representativepointsarenaturallyrelativelywelldistributed.Thenumberofpointschosenisaparameter,butitwasfoundthatavalueof10ormoreworkedwell.

Oncetherepresentativepointsarechosen,theyareshrunktowardthecenterbyafactor,α.Thishelpsmoderatetheeffectofoutliers,whichareusuallyfartherawayfromthecenterandthus,areshrunkmore.Forexample,arepresentativepointthatwasadistanceof10unitsfromthecenterwouldmoveby3units(for ),whilearepresentativepointatadistanceof1unitwouldonlymove0.3units.

CUREusesanagglomerativehierarchicalschemetoperformtheactualclustering.Thedistancebetweentwoclustersistheminimumdistancebetweenanytworepresentativepoints(aftertheyareshrunktowardtheirrespectivecenters).Whilethisschemeisnotexactlylikeanyotherhierarchicalschemethatwehaveseen,itisequivalenttocentroid-basedhierarchicalclusteringif ,androughlythesameassinglelinkhierarchicalclusteringif .Noticethatwhileahierarchicalclusteringschemeisused,thegoalofCUREistofindagivennumberofclustersasspecifiedbytheuser.

CUREtakesadvantageofcertaincharacteristicsofthehierarchicalclusteringprocesstoeliminateoutliersattwodifferentpointsintheclusteringprocess.First,ifaclusterisgrowingslowly,thenitmayconsistofoutliers,sincebydefinition,outliersarefarfromothersandwillnotbemergedwithotherpointsveryoften.InCURE,thisfirstphaseofoutliereliminationtypicallyoccurswhenthenumberofclustersis1/3theoriginalnumberofpoints.ThesecondphaseofoutliereliminationoccurswhenthenumberofclustersisontheorderofK,thenumberofdesiredclusters.Atthispoint,smallclustersareagaineliminated.

α=0.7

α=0α=1

Becausetheworst-casecomplexityofCUREis ,itcannotbeapplieddirectlytolargedatasets.Forthisreason,CUREusestwotechniquestospeeduptheclusteringprocess.Thefirsttechniquetakesarandomsampleandperformshierarchicalclusteringonthesampleddatapoints.Thisisfollowedbyafinalpassthatassignseachremainingpointinthedatasettooneoftheclustersbychoosingtheclusterwiththeclosestrepresentativepoint.WediscussCURE’ssamplingapproachinmoredetaillater.

Insomecases,thesamplerequiredforclusteringisstilltoolargeandasecondadditionaltechniqueisrequired.Inthissituation,CUREpartitionsthesampledataandthenclustersthepointsineachpartition.Thispreclusteringstepisthenfollowedbyaclusteringoftheintermediateclustersandafinalpassthatassignseachpointinthedatasettooneoftheclusters.CURE’spartitioningschemeisalsodiscussedinmoredetaillater.

Algorithm8.15 summarizesCURE.NotethatKisthedesirednumberofclusters,misthenumberofpoints,pisthenumberofpartitions,andqisthedesiredreductionofpointsinapartition,i.e.,thenumberofclustersinapartitionis .Therefore,thetotalnumberofclustersis .Forexample,if , ,and ,theneachpartitioncontainspoints,andtherewouldbe clustersineachpartitionand

clustersoverall.

Algorithm8.15CURE.

O(m2logm)

mpq mqm=10,000 p=10 q=100 10,000/10=1000

1000/100=1010,000/100=100

1:Drawarandomsamplefromthedataset.TheCUREpaperisnotableforexplicitlyderivingaformulaforwhatthesizeofthissampleshouldbeinordertoguarantee,withhighprobability,thatallclustersarerepresentedbyaminimumnumberofpoints.2:Partitionthesampleintopequal-sizedpartitions.

SamplinginCUREAkeyissueinusingsamplingiswhetherthesampleisrepresentative,thatis,whetheritcapturesthecharacteristicsofinterest.Forclustering,theissueiswhetherwecanfindthesameclustersinthesampleasintheentiresetofobjects.Ideally,wewouldlikethesampletocontainsomeobjectsforeachclusterandfortheretobeaseparateclusterinthesampleforthoseobjectsthatbelongtoseparateclustersintheentiredataset.

Amoreconcreteandattainablegoalistoguarantee(withahighprobability)thatwehaveatleastsomepointsfromeachcluster.Thenumberofpointsrequiredforsuchasamplevariesfromonedatasettoanotheranddependsonthenumberofobjectsandthesizesoftheclusters.ThecreatorsofCUREderivedaboundforthesamplesizethatwouldbeneededtoensure(withhighprobability)thatweobtainatleastacertainnumberofpointsfromacluster.Usingthenotationofthisbook,thisboundisgivenbythefollowingtheorem.

3:Clusterthepointsineachpartitioninto clustersusingCURE’shierarchicalclusteringalgorithmtoobtainatotalof clusters.Notethatsomeoutliereliminationoccursduringthisprocess.4:UseCURE’shierarchicalclusteringalgorithmtoclusterthe clustersfoundinthepreviousstepuntilonlyKclustersremain.5:Eliminateoutliers.Thisisthesecondphaseofoutlierelimination.6:Assignallremainingdatapointstothenearestclustertoobtainacompleteclustering.

mpq

mq

mq

Theorem8.1.Letfbeafraction, .Forcluster ofsize ,wewillobtainatleast objectsfromcluster withaprobabilityof

,ifoursamplesizesisgivenbythefollowing:

wheremisthenumberofobjects.

Whilethisexpressionmightlookintimidating,itisreasonablyeasytouse.Supposethatthereare100,000objectsandthatthegoalistohavean80%chanceofobtaining10%oftheobjectsincluster ,whichhasasizeof1000.Inthiscase, ,andthus .Ifthegoalisa5%sampleof ,whichis50objects,thenasamplesizeof6440willsuffice.

Again,CUREusessamplinginthefollowingway.Firstasampleisdrawn,andthenCUREisusedtoclusterthissample.Afterclustershavebeenfound,eachunclusteredpointisassignedtotheclosestcluster.

PartitioningWhensamplingisnotenough,CUREalsousesapartitioningapproach.Theideaistodividethepointsintopgroupsofsizem/pandtouseCUREtoclustereachpartitioninordertoreducethenumberofobjectsbyafactorof

,whereqcanberoughlythoughtofastheaveragesizeofaclusterina

0≤f≤1 Ci mif*mi Ci

1−δ,0≤δ≤1

s=fm+mmi*log1δ+mmilog12δ+2*f*mi*log1δ. (8.21)

Cif=0.1,δ=0.2,m=100,000,mi=1000 s=11,962

Ci

q>1

partition.Overall, clustersareproduced.(NotethatsinceCURErepresentseachclusterbyanumberofrepresentativepoints,thereductioninthenumberofobjectsisnotq.)Thispreclusteringstepisthenfollowedbyafinalclusteringofthem/qintermediateclusterstoproducethedesirednumberofclusters(K).BothclusteringpassesuseCURE’shierarchicalclusteringalgorithmandarefollowedbyafinalpassthatassignseachpointinthedatasettooneoftheclusters.

Thekeyissueishowpandqshouldbechosen.AlgorithmssuchasCUREhaveatimecomplexityof orhigher,andfurthermore,requirethatallthedatabeinmainmemory.Wethereforewanttochoosepsmallenoughsothatanentirepartitioncanbeprocessedinmainmemoryandina‘reasonable’amountoftime.Atthecurrenttime,atypicaldesktopcomputercanperformahierarchicalclusteringofafewthousandobjectsinafewseconds.

Anotherfactorforchoosingp,andalsoq,concernsthequalityoftheclustering.Specifically,theobjectiveistochoosethevaluesofpandqsuchthatobjectsfromthesameunderlyingclusterendupinthesameclusterseventually.Toillustrate,supposethereare1000objectsandaclusterofsize100.Ifwerandomlygenerate100partitions,theneachpartitionwill,onaverage,haveonlyonepointfromourcluster.Thesepointswilllikelybeputinclusterswithpointsfromotherclustersorwillbediscardedasoutliers.Ifwegenerateonly10partitionsof100objects,butqis50,thenthe10pointsfromeachcluster(onaverage)willlikelystillbecombinedwithpointsfromotherclusters,becausethereareonly(onaverage)10pointsperclusterandweneedtoproduce,foreachpartition,twoclusters.Toavoidthislastproblem,whichconcernstheproperchoiceofq,asuggestedstrategyisnottocombineclustersiftheyaretoodissimilar.

mq

O(m2)

8.6WhichClusteringAlgorithm?Avarietyoffactorsneedtobeconsideredwhendecidingwhichtypeofclusteringtechniquetouse.Many,ifnotall,ofthesefactorshavebeendiscussedtosomeextentinthecurrentandpreviouschapters.Ourgoalinthissectionistosuccinctlysummarizethesefactorsinawaythatshedssomelightonwhichclusteringalgorithmmightbeappropriateforaparticularclusteringtask.

TypeofClusteringForaclusteringalgorithmtobeappropriateforatask,thetypeofclusteringproducedbythealgorithmneedstomatchthetypeofclusteringneededbytheapplication.Forsomeapplications,suchascreatingabiologicaltaxonomy,ahierarchyispreferred.Inthecaseofclusteringforsummarization,apartitionalclusteringistypical.Inyetotherapplications,bothcanproveuseful.

Mostclusteringapplicationsrequireaclusteringofall(oralmostall)oftheobjects.Forinstance,ifclusteringisusedtoorganizeasetofdocumentsforbrowsing,thenwewouldlikemostdocumentstobelongtoagroup.However,ifwewantedtofindthestrongestthemesinasetofdocuments,thenwemightprefertohaveaclusteringschemethatproducesonlyverycohesiveclusters,evenifmanydocumentswereleftunclustered.

Finally,mostapplicationsofclusteringassumethateachobjectisassignedtoonecluster(oroneclusteronalevelforhierarchicalschemes).Aswehaveseen,however,probabilisticandfuzzyschemesprovideweightsthatindicatethedegreeorprobabilityofmembershipinvariousclusters.Othertechniques,suchasDBSCANandSNNdensity-basedclustering,havethenotionofcore

points,whichstronglybelongtoonecluster.Suchconceptsmaybeusefulincertainapplications.

TypeofClusterAnotherkeyaspectiswhetherthetypeofclustermatchestheintendedapplication.Therearethreecommonlyencounteredtypesofclusters:prototype-,graph-,anddensity-based.Prototype-basedclusteringschemes,aswellassomegraph-basedclusteringschemes—completelink,centroid,andWard’s—tendtoproduceglobularclustersinwhicheachobjectisclosetothecluster’sprototypeand/ortotheotherobjectsinthecluster.If,forexample,wewanttosummarizethedatatoreduceitssizeandwewanttodosowiththeminimumamountoferror,thenoneofthesetypesoftechniqueswouldbemostappropriate.Incontrast,density-basedclusteringtechniques,aswellassomegraph-basedclusteringtechniques,suchassinglelink,tendtoproduceclustersthatarenotglobularandthuscontainmanyobjectsthatarenotverysimilartooneanother.Ifclusteringisusedtosegmentageographicalareaintocontiguousregionsbasedonthetypeoflandcover,thenoneofthesetechniquesismoresuitablethanaprototype-basedschemesuchasK-means.

CharacteristicsofClustersBesidesthegeneraltypeofcluster,otherclustercharacteristicsareimportant.Ifwewanttofindclustersinsubspacesoftheoriginaldataspace,thenwemustchooseanalgorithmsuchasCLIQUE,whichexplicitlylooksforsuchclusters.Similarly,ifweareinterestedinenforcingspatialrelationshipsbetweenclusters,thenSOMorsomerelatedapproachwouldbeappropriate.Also,clusteringalgorithmsdifferwidelyintheirabilitytohandleclustersofvaryingshapes,sizes,anddensities.

CharacteristicsoftheDataSetsandAttributesAsdiscussedintheintroduction,thetypeofdatasetandattributescandictatethetypeofalgorithmtouse.Forinstance,theK-meansalgorithmcanonlybeusedondataforwhichanappropriateproximitymeasureisavailablethatallows

meaningfulcomputationofaclustercentroid.Forotherclusteringtechniques,suchasmanyagglomerativehierarchicalapproaches,theunderlyingnatureofthedatasetsandattributesislessimportantaslongasaproximitymatrixcanbecreated.

NoiseandOutliersNoiseandoutliersareparticularlyimportantaspectsofthedata.Wehavetriedtoindicatetheeffectofnoiseandoutliersonthevariousclusteringalgorithmsthatwehavediscussed.Inpractice,however,itcanbedifficulttoevaluatetheamountofnoiseinthedatasetorthenumberofoutliers.Morethanthat,whatisnoiseoranoutliertoonepersonmightbeinterestingtoanotherperson.Forexample,ifweareusingclusteringtosegmentanareaintoregionsofdifferentpopulationdensity,wedonotwanttouseadensity-basedclusteringtechnique,suchasDBSCAN,thatassumesthatregionsorpointswithdensitylowerthanaglobalthresholdarenoiseoroutliers.Asanotherexample,hierarchicalclusteringschemes,suchasCURE,oftendiscardclustersofpointsthataregrowingslowlyassuchgroupstendtorepresentoutliers.However,insomeapplicationswearemostinterestedinrelativelysmallclusters;e.g.,inmarketsegmentation,suchgroupsmightrepresentthemostprofitablecustomers.

NumberofDataObjectsWehaveconsideredhowclusteringisaffectedbythenumberofdataobjectsinconsiderabledetailinprevioussections.Wereiterate,however,thatthisfactoroftenplaysanimportantroleindeterminingthetypeofclusteringalgorithmtobeused.Supposethatwewanttocreateahierarchicalclusteringofasetofdata,wearenotinterestedinacompletehierarchythatextendsallthewaytoindividualobjects,butonlytothepointatwhichwehavesplitthedataintoafewhundredclusters.Ifthedataisverylarge,wecannotdirectlyapplyanagglomerativehierarchicalclusteringtechnique.Wecould,however,useadivisiveclusteringtechnique,suchastheminimumspanningtree(MST)algorithm,whichisthedivisiveanalogtosinglelink,butthiswouldonlyworkifthedatasetisnottoolarge.BisectingK-

meanswouldalsoworkformanydatasets,butifthedatasetislargeenoughthatitcannotbecontainedcompletelyinmemory,thenthisschemealsorunsintoproblems.Inthissituation,atechniquesuchasBIRCH,whichdoesnotrequirethatalldatabeinmainmemory,becomesmoreuseful.

NumberofAttributesWehavealsodiscussedtheimpactofdimensionalityatsomelength.Again,thekeypointistorealizethatanalgorithmthatworkswellinlowormoderatedimensionsmaynotworkwellinhighdimensions.Asinmanyothercasesinwhichaclusteringalgorithmisinappropriatelyapplied,theclusteringalgorithmwillrunandproduceclusters,buttheclusterswilllikelynotrepresentthetruestructureofthedata.

ClusterDescriptionOneaspectofclusteringtechniquesthatisoftenoverlookedishowtheresultingclustersaredescribed.Prototypeclustersaresuccinctlydescribedbyasmallsetofclusterprototypes.Inthecaseofmixturemodels,theclustersaredescribedintermsofsmallsetsofparameters,suchasthemeanvectorandthecovariancematrix.Thisisalsoaverycompactandunderstandablerepresentation.ForSOM,itistypicallypossibletovisualizetherelationshipsbetweenclustersinatwo-dimensionalplot,suchasthatofFigure8.8 .Forgraph-anddensity-basedclusteringapproaches,however,clustersaretypicallydescribedassetsofclustermembers.Nonetheless,inCURE,clusterscanbedescribedbya(relatively)smallsetofrepresentativepoints.Also,forgrid-basedclusteringschemes,suchasCLIQUE,morecompactdescriptionscanbegeneratedintermsofconditionsontheattributevaluesthatdescribethegridcellsinthecluster.

AlgorithmicConsiderationsTherearealsoimportantaspectsofalgorithmsthatneedtobeconsidered.Isthealgorithmnon-deterministicororder-dependent?Doesthealgorithmautomaticallydeterminethenumberofclusters?Isthereatechniquefordeterminingthevaluesofvariousparameters?Manyclusteringalgorithmstrytosolvetheclusteringproblemby

tryingtooptimizeanobjectivefunction.Istheobjectiveagoodmatchfortheapplicationobjective?Ifnot,thenevenifthealgorithmdoesagoodjoboffindingaclusteringthatisoptimalorclosetooptimalwithrespecttotheobjectivefunction,theresultisnotmeaningful.Also,mostobjectivefunctionsgivepreferencetolargerclustersattheexpenseofsmallerclusters.

SummaryThetaskofchoosingtheproperclusteringalgorithminvolvesconsideringalloftheseissues,anddomain-specificissuesaswell.Thereisnoformulafordeterminingthepropertechnique.Nonetheless,ageneralknowledgeofthetypesofclusteringtechniquesthatareavailableandconsiderationoftheissuesmentionedabove,togetherwithafocusontheintendedapplication,shouldallowadataanalysttomakeaninformeddecisiononwhichclusteringapproach(orapproaches)totry.

8.7BibliographicNotesAnextensivediscussionoffuzzyclustering,includingadescriptionoffuzzyc-meansandformalderivationsoftheformulaspresentedinSection8.2.1 ,canbefoundinthebookonfuzzyclusteranalysisbyHöppneretal.[595].Whilenotdiscussedinthischapter,AutoClassbyCheesemanetal.[573]isoneoftheearliestandmostprominentmixture-modelclusteringprograms.AnintroductiontomixturemodelscanbefoundinthetutorialofBilmes[568],thebookbyMitchell[606](whichalsodescribeshowtheK-meansalgorithmcanbederivedfromamixturemodelapproach),andthearticlebyFraleyandRaftery[581].Mixturemodelisanexampleofaprobabilisticclusteringmethod,inwhichtheclustersarerepresentedashiddenvariablesinthemodel.MoresophisticatedprobabilisticclusteringmethodssuchaslatentDirichletallocation(LDA)[570]havebeendevelopedinrecentyearsfordomainssuchastextclustering.

Besidesdataexploration,SOManditssupervisedlearningvariant,LearningVectorQuantization(LVQ),havebeenusedformanytasks:imagesegmentation,organizationofdocumentfiles,andspeechprocessing.OurdiscussionofSOMwascastintheterminologyofprototype-basedclustering.ThebookonSOMbyKohonenetal.[601]containsanextensiveintroductiontoSOMthatemphasizesitsneuralnetworkorigins,aswellasadiscussionofsomeofitsvariationsandapplications.OneimportantSOM-relatedclusteringdevelopmentistheGenerativeTopographicMap(GTM)algorithmbyBishopetal.[569],whichusestheEMalgorithmtofindGaussianmodelssatisfyingtwo-dimensionaltopographicconstraints.

ThedescriptionofChameleoncanbefoundinthepaperbyKarypisetal.[599].Capabilitiessimilar,althoughnotidenticaltothoseofChameleonhave

beenimplementedintheCLUTOclusteringpackagebyKarypis[575].TheMETISgraphpartitioningpackagebyKarypisandKumar[600]isusedtoperformgraphpartitioninginbothprograms,aswellasintheOPOSSUMclusteringalgorithmbyStrehlandGhosh[616].AdetaileddiscussiononspectralclusteringcanbefoundinthetutorialbyvonLuxburg[618].ThespectralclusteringmethoddescribedinthischapterisbasedonanunnormalizedgraphLaplacianmatrixandtheratiocutmeasure[590].AlternativeformulationsofspectralclusteringhavebeendevelopedusingnormalizedgraphLaplacianmatricesforotherevaluationmeasures[613].

ThenotionofSNNsimilaritywasintroducedbyJarvisandPatrick[596].AhierarchicalclusteringschemebasedonasimilarconceptofmutualnearestneighborswasproposedbyGowdaandKrishna[586].Guhaetal.[589]createdROCK,ahierarchicalgraph-basedclusteringalgorithmforclusteringtransactiondata,whichamongotherinterestingfeatures,alsousesanotionofsimilaritybasedonsharedneighborsthatcloselyresemblestheSNNsimilaritydevelopedbyJarvisandPatrick.AdescriptionoftheSNNdensity-basedclusteringtechniquecanbefoundinthepublicationsofErtözetal.[578,579].SNNdensity-basedclusteringwasusedbySteinbachetal.[614]tofindclimateindices.

Examplesofgrid-basedclusteringalgorithmsareOptiGrid(HinneburgandKeim[594]),theBANGclusteringsystem(SchikutaandErhart[611]),andWaveCluster(Sheikholeslamietal.[612]).TheCLIQUEalgorithmisdescribedinthepaperbyAgrawaletal.[564].MAFIA(Nageshetal.[608])isamodificationofCLIQUEwhosegoalisimprovedefficiency.Kailingetal.[598]havedevelopedSUBCLU(density-connectedSUBspaceCLUstering),asubspaceclusteringalgorithmbasedonDBSCAN.TheDENCLUEalgorithmwasproposedbyHinneburgandKeim[593].

OurdiscussionofscalabilitywasstronglyinfluencedbythearticleofGhosh[584].Awide-rangingdiscussionofspecifictechniquesforclusteringmassivedatasetscanbefoundinthepaperbyMurtagh[607].CUREisworkbyGuhaetal.[588],whiledetailsofBIRCHareinthepaperbyZhangetal.[620].CLARANS(NgandHan[609])isanalgorithmforscalingK-medoidclusteringtolargerdatabases.AdiscussionofscalingEMandK-meansclusteringtolargedatasetsisprovidedbyBradleyetal.[571,572].AparallelimplementationofK-meansontheMapReduceframeworkhasalsobeendeveloped[621].InadditiontoK-means,otherclusteringalgorithmsthathavebeenimplementedontheMapReduceframeworkincludeDBScan[592],spectralclustering[574],andhierarchicalclustering[617].

Inadditiontotheapproachesdescribedinthischapter,therearemanyotherclusteringmethodsproposedintheliterature.Oneclassofmethodsthathasbecomeincreasinglypopularinrecentyearsisbasedonnon-negativematrixfactorization(NMF)[602].Theideaisanextensionofthesingularvaluedecomposition(SVD)approachdescribedinChapter2 ,inwhichthedatamatrixisdecomposedintoaproductoflower-rankmatricesthatrepresenttheunderlyingcomponentsorclustersinthedata.InNMF,additionalconstraintsareimposedtoensurenon-negativityintheelementsofthecomponentmatrices.Withdifferentformulationsandconstraints,theNMFmethodcanbeshowntobeequivalenttootherclusteringapproaches,includingK-meansandspectralclustering[577,603].Anotherpopularclassofmethodsutilizestheconstraintsprovidedbyuserstoguidetheclusteringalgorithm.Suchalgorithmsarecommonlyknownasconstrainedclusteringorsemi-supervisedclustering[566,567,576,619].

Therearemanyaspectsofclusteringthatwehavenotcovered.AdditionalpointersaregiveninthebooksandsurveysmentionedintheBibliographicNotesofthepreviouschapter.Here,wementionfourareas—omitting,unfortunately,manymore.Clusteringoftransactiondata(Gantietal.[582],

Gibsonetal.[585],Hanetal.[591],andPetersandZaki[610])isanimportantarea,astransactiondataiscommonandofcommercialimportance.Streamingdataisalsobecomingincreasinglycommonandimportantascommunicationsandsensornetworksbecomepervasive.TwointroductionstoclusteringfordatastreamsaregiveninarticlesbyBarbará[565]andGuhaetal.[587].Conceptualclustering(FisherandLangley[580],Jonyeretal.[597],Mishraetal.[605],MichalskiandStepp[604],SteppandMichalski[615]),whichusesmorecomplicateddefinitionsofclustersthatoftencorrespondbettertohumannotionsofacluster,isanareaofclusteringwhosepotentialhasperhapsnotbeenfullyrealized.Finally,therehasbeenagreatdealofclusteringworkfordatacompressionintheareaofvectorquantization.ThebookbyGershoandGray[583]isastandardtextinthisarea.

Bibliography[564]R.Agrawal,J.Gehrke,D.Gunopulos,andP.Raghavan.Automatic

subspaceclusteringofhighdimensionaldatafordataminingapplications.InProc.of1998ACMSIGMODIntl.Conf.onManagementofData,pages94–105,Seattle,Washington,June1998.ACMPress.

[565]D.Barbará.Requirementsforclusteringdatastreams.SIGKDDExplorationsNewsletter,3(2):23–27,2002.

[566]S.Basu,A.Banerjee,andR.Mooney.Semi-supervisedclusteringbyseeding.InProceedingsof19thInternationalConferenceonMachineLearning,pages19–26,2002.

[567]S.Basu,I.Davidson,andK.Wagstaff.ConstrainedClustering:AdvancesinAlgorithms,Theory,andApplications.CRCPress,2008.

[568]J.Bilmes.AGentleTutorialontheEMAlgorithmanditsApplicationtoParameterEstimationforGaussianMixtureandHiddenMarkovModels.TechnicalReportICSITR-97-021,UniversityofCaliforniaatBerkeley,1997.

[569]C.M.Bishop,M.Svensen,andC.K.I.Williams.GTM:Aprincipledalternativetotheself-organizingmap.InC.vonderMalsburg,W.vonSeelen,J.C.Vorbruggen,andB.Sendhoff,editors,ArtificialNeural

Networks—ICANN96.Intl.Conf,Proc.,pages165–170.Springer-Verlag,Berlin,Germany,1996.

[570]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletAllocation.JournalofMachineLearningResearch,3(4-5):993–1022,2003.

[571]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingClusteringAlgorithmstoLargeDatabases.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages9–15,NewYorkCity,August1998.AAAIPress.

[572]P.S.Bradley,U.M.Fayyad,andC.Reina.ScalingEM(ExpectationMaximization)ClusteringtoLargeDatabases.TechnicalReportMSR-TR-98-35,MicrosoftResearch,October1999.

[573]P.Cheeseman,J.Kelly,M.Self,J.Stutz,W.Taylor,andD.Freeman.AutoClass:aBayesianclassificationsystem.InReadingsinknowledgeacquisitionandlearning:automatingtheconstructionandimprovementofexpertsystems,pages431–441.MorganKaufmannPublishersInc.,1993.

[574]W.Y.Chen,Y.Song,H.Bai,C.J.Lin,andE.Y.Chang.Parallelspectralclusteringindistributedsystems.IEEETransactionsonPatternAnalysisandMachineIntelligence,33(3):568586,2011.

[575]CLUTO2.1.2:SoftwareforClusteringHigh-DimensionalDatasets.www.cs.umn.edu/∼karypis,October2016.

[576]I.DavidsonandS.Basu.Asurveyofclusteringwithinstancelevelconstraints.ACMTransactionsonKnowledgeDiscoveryfromData,1:1–41,2007.

[577]C.Ding,X.He,andH.Simon.Ontheequivalenceofnonnegativematrixfactorizationandspectralclustering.InProcoftheSIAMInternationalConferenceonDataMining,page606-610,2005.

[578]L.Ertöz,M.Steinbach,andV.Kumar.ANewSharedNearestNeighborClusteringAlgorithmanditsApplications.InWorkshoponClusteringHighDimensionalDataanditsApplications,Proc.ofTextMine’01,FirstSIAMIntl.Conf.onDataMining,Chicago,IL,USA,2001.

[579]L.Ertöz,M.Steinbach,andV.Kumar.FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData.InProc.ofthe2003SIAMIntl.Conf.onDataMining,SanFrancisco,May2003.SIAM.

[580]D.FisherandP.Langley.Conceptualclusteringanditsrelationtonumericaltaxonomy.ArtificialIntelligenceandStatistics,pages77–116,1986.

[581]C.FraleyandA.E.Raftery.HowManyClusters?WhichClusteringMethod?AnswersViaModel-BasedClusterAnalysis.TheComputerJournal,41(8):578–588,1998.

[582]V.Ganti,J.Gehrke,andR.Ramakrishnan.CACTUS–ClusteringCategoricalDataUsingSummaries.InProc.ofthe5thIntl.Conf.on

KnowledgeDiscoveryandDataMining,pages73–83.ACMPress,1999.

[583]A.GershoandR.M.Gray.VectorQuantizationandSignalCompression,volume159ofKluwerInternationalSeriesinEngineeringandComputerScience.KluwerAcademicPublishers,1992.

[584]J.Ghosh.ScalableClusteringMethodsforDataMining.InN.Ye,editor,HandbookofDataMining,pages247–277.LawrenceEalbaumAssoc,2003.

[585]D.Gibson,J.M.Kleinberg,andP.Raghavan.ClusteringCategoricalData:AnApproachBasedonDynamicalSystems.VLDBJournal,8(3–4):222–236,2000.

[586]K.C.GowdaandG.Krishna.AgglomerativeClusteringUsingtheConceptofMutualNearestNeighborhood.PatternRecognition,10(2):105–112,1978.

[587]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O’Callaghan.ClusteringDataStreams:TheoryandPractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515–528,May/June2003.

[588]S.Guha,R.Rastogi,andK.Shim.CURE:AnEfficientClusteringAlgorithmforLargeDatabases.InProc.of1998ACM-SIGMODIntl.Conf.onManagementofData,pages73–84.ACMPress,June1998.

[589]S.Guha,R.Rastogi,andK.Shim.ROCK:ARobustClusteringAlgorithmforCategoricalAttributes.InProc.ofthe15thIntl.Conf.onDataEngineering,pages512–521.IEEEComputerSociety,March1999.

[590]L.HagenandA.Kahng.Newspectralmethodsforratiocutpartitioningandclustering.IEEETrans.Computer-AidedDesign,11(9):10741085,1992.

[591]E.-H.Han,G.Karypis,V.Kumar,andB.Mobasher.HypergraphBasedClusteringinHigh-DimensionalDataSets:ASummaryofResults.IEEEDataEng.Bulletin,21(1):15–22,1998.

[592]Y.He,H.Tan,W.Luo,H.Mao,D.Ma,S.Feng,andJ.Fan.MR-DBSCAN:anefficientparalleldensity-basedclusteringalgorithmusingMapReduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedSystems,pages473–480,2011.

[593]A.HinneburgandD.A.Keim.AnEfficientApproachtoClusteringinLargeMultimediaDatabaseswithNoise.InProc.ofthe4thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages58–65,NewYorkCity,August1998.AAAIPress.

[594]A.HinneburgandD.A.Keim.OptimalGrid-Clustering:TowardsBreakingtheCurseofDimensionalityinHigh-DimensionalClustering.InProc.ofthe25thVLDBConf.,pages506–517,Edinburgh,Scotland,UK,September1999.MorganKaufmann.

[595]F.Höppner,F.Klawonn,R.Kruse,andT.Runkler.FuzzyClusterAnalysis:MethodsforClassification,DataAnalysisandImageRecognition.JohnWiley&Sons,NewYork,July21999.

[596]R.A.JarvisandE.A.Patrick.ClusteringUsingaSimilarityMeasureBasedonSharedNearestNeighbors.IEEETransactionsonComputers,C-22(11):1025–1034,1973.

[597]I.Jonyer,D.J.Cook,andL.B.Holder.Graph-basedhierarchicalconceptualclustering.JournalofMachineLearningResearch,2:19–43,2002.

[598]K.Kailing,H.-P.Kriegel,andP.Kröger.Density-ConnectedSubspaceClusteringforHigh-DimensionalData.InProc.ofthe2004SIAMIntl.Conf.onDataMining,pages428–439,LakeBuenaVista,Florida,April2004.SIAM.

[599]G.Karypis,E.-H.Han,andV.Kumar.CHAMELEON:AHierarchicalClusteringAlgorithmUsingDynamicModeling.IEEEComputer,32(8):68–75,August1999.

[600]G.KarypisandV.Kumar.Multilevelk-wayPartitioningSchemeforIrregularGraphs.JournalofParallelandDistributedComputing,48(1):96–129,1998.

[601]T.Kohonen,T.S.Huang,andM.R.Schroeder.Self-OrganizingMaps.Springer-Verlag,December2000.

[602]D.D.LeeandH.S.Seung.Learningthepartsofobjectsbynon-negativematrixfactorization.Nature,401(6755):788791,1999.

[603]T.LiandC.H.Q.Ding.TheRelationshipsAmongVariousNonnegativeMatrixFactorizationMethodsforClustering.InProcoftheIEEEInternationalConferenceonDataMining,pages362–371,2006.

[604]R.S.MichalskiandR.E.Stepp.AutomatedConstructionofClassifications:ConceptualClusteringVersusNumericalTaxonomy.IEEETransactionsonPatternAnalysisandMachineIntelligence,5(4):396–409,1983.

[605]N.Mishra,D.Ron,andR.Swaminathan.ANewConceptualClusteringFramework.MachineLearningJournal,56(1–3):115–151,July/August/September2004.

[606]T.Mitchell.MachineLearning.McGraw-Hill,Boston,MA,1997.

[607]F.Murtagh.Clusteringmassivedatasets.InJ.Abello,P.M.Pardalos,andM.G.C.Reisende,editors,HandbookofMassiveDataSets.Kluwer,2000.

[608]H.Nagesh,S.Goil,andA.Choudhary.ParallelAlgorithmsforClusteringHigh-DimensionalLarge-ScaleDatasets.InR.L.Grossman,C.Kamath,P.Kegelmeyer,V.Kumar,andR.Namburu,editors,DataMiningforScientificandEngineeringApplications,pages335–356.KluwerAcademicPublishers,Dordrecht,Netherlands,October2001.

[609]R.T.NgandJ.Han.CLARANS:AMethodforClusteringObjectsforSpatialDataMining.IEEETransactionsonKnowledgeandDataEngineering,14(5):1003–1016,2002.

[610]M.PetersandM.J.Zaki.CLICKS:ClusteringCategoricalDatausingK-partiteMaximalCliques.InProc.ofthe21stIntl.Conf.onDataEngineering,Tokyo,Japan,April2005.

[611]E.SchikutaandM.Erhart.TheBANG-ClusteringSystem:Grid-BasedDataAnalysis.InAdvancesinIntelligentDataAnalysis,ReasoningaboutData,SecondIntl.Symposium,IDA-97,London,volume1280ofLectureNotesinComputerScience,pages513–524.Springer,August1997.

[612]G.Sheikholeslami,S.Chatterjee,andA.Zhang.Wavecluster:Amulti-resolutionclusteringapproachforverylargespatialdatabases.InProc.ofthe24thVLDBConf.,pages428–439,NewYorkCity,August1998.MorganKaufmann.

[613]J.ShiandJ.Malik.Normalizedcutsandimagesegmentation.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8):888905,2000.

[614]M.Steinbach,P.-N.Tan,V.Kumar,S.Klooster,andC.Potter.Discoveryofclimateindicesusingclustering.InKDD’03:ProceedingsoftheninthACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages446–455,NewYork,NY,USA,2003.ACMPress.

[615]R.E.SteppandR.S.Michalski.Conceptualclusteringofstructuredobjects:Agoal-orientedapproach.ArtificialIntelligence,28(1):43–69,1986.

[616]A.StrehlandJ.Ghosh.AScalableApproachtoBalanced,High-dimensionalClusteringofMarket-Baskets.InProc.ofthe7thIntl.Conf.onHighPerformanceComputing(HiPC2000),volume1970ofLectureNotesinComputerScience,pages525–536,Bangalore,India,December2000.Springer.

[617]T.Sun,C.Shu,F.Li,H.Yu,L.Ma,andY.Fang.Anefficienthierarchicalclusteringmethodforlargedatasetswithmap-reduce.InProcoftheIEEEInternationalConferenceonParallelandDistributedComputing,ApplicationsandTechnologies,pages494–499,2009.

[618]U.vonLuxburg.Atutorialonspectralclustering.StatisticsandComputing,17(4):395–416,2007.

[619]K.Wagstaff,C.Cardie,S.Rogers,andS.Schroedl.ConstrainedK-meansClusteringwithBackgroundKnowledge.InProceedingsof18thInternationalConferenceonMachineLearning,pages577–584,2001.

[620]T.Zhang,R.Ramakrishnan,andM.Livny.BIRCH:anefficientdataclusteringmethodforverylargedatabases.InProc.of1996ACM-SIGMODIntl.Conf.onManagementofData,pages103–114,Montreal,Quebec,Canada,June1996.ACMPress.

[621]W.Zhao,H.Ma,andQ.He.ParallelK-MeansClusteringbasedonMapReduce.InProcoftheIEEEInternationalConferenceonCloudComputing,page674-679,2009.

8.8Exercises1.Forsparsedata,discusswhyconsideringonlythepresenceofnon-zerovaluesmightgiveamoreaccurateviewoftheobjectsthanconsideringtheactualmagnitudesofvalues.Whenwouldsuchanapproachnotbedesirable?

2.DescribethechangeinthetimecomplexityofK-meansasthenumberofclusterstobefoundincreases.

3.Considerasetofdocuments.Assumethatalldocumentshavebeennormalizedtohaveunitlengthof1.Whatisthe“shape”ofaclusterthatconsistsofalldocumentswhosecosinesimilaritytoacentroidisgreaterthansomespecifiedconstant?Inotherwords, ,where .

4.Discusstheadvantagesanddisadvantagesoftreatingclusteringasanoptimizationproblem.Amongotherfactors,considerefficiency,non-determinism,andwhetheranoptimization-basedapproachcapturesalltypesofclusteringsthatareofinterest.

5.Whatisthetimeandspacecomplexityoffuzzyc-means?OfSOM?HowdothesecomplexitiescomparetothoseofK-means?

6.TraditionalK-meanshasanumberoflimitations,suchassensitivitytooutliersanddifficultyinhandlingclustersofdifferentsizesanddensities,orwithnon-globularshapes.Commentontheabilityoffuzzyc-meanstohandlethesesituations.

7.Forthefuzzyc-meansalgorithmdescribedinthisbook,thesumofthemembershipdegreeofanypointoverallclustersis1.Instead,wecouldonlyrequirethatthemembershipdegreeofapointinaclusterbebetween0and1.Whataretheadvantagesanddisadvantagesofsuchanapproach?

cos(d,c)≥δ 0<δ≤1

8.Explainthedifferencebetweenlikelihoodandprobability.

9.Equation8.12 givesthelikelihoodforasetofpointsfromaGaussiandistributionasafunctionofthemeanμandthestandarddeviationσ.Showmathematicallythatthemaximumlikelihoodestimateofμandσarethesamplemeanandthesamplestandarddeviation,respectively.

10.Wetakeasampleofadultsandmeasuretheirheights.Ifwerecordthegenderofeachperson,wecancalculatetheaverageheightandthevarianceoftheheight,separately,formenandwomen.Suppose,however,thatthisinformationwasnotrecorded.Woulditbepossibletostillobtainthisinformation?Explain.

11.ComparethemembershipweightsandprobabilitiesofFigures8.1 and8.4 ,whichcome,respectively,fromapplyingfuzzyandEMclusteringtothesamesetofdatapoints.Whatdifferencesdoyoudetect,andhowmightyouexplainthesedifferences?

12.Figure8.32 showsaclusteringofatwo-dimensionalpointdatasetwithtwoclusters:Theleftmostcluster,whosepointsaremarkedbyasterisks,issomewhatdiffuse,whiletherightmostcluster,whosepointsaremarkedbycircles,iscompact.Totherightofthecompactcluster,thereisasinglepoint(markedbyanarrow)thatbelongstothediffusecluster,whosecenterisfartherawaythanthatofthecompactcluster.ExplainwhythisispossiblewithEMclustering,butnotK-meansclustering.

Figure8.32.DatasetforExercise12 .EMclusteringofatwo-dimensionalpointsetwithtwoclustersofdifferingdensity.

13.ShowthattheMSTclusteringtechniqueofSection8.4.2 producesthesameclustersassinglelink.Toavoidcomplicationsandspecialcases,assumethatallthepairwisesimilaritiesaredistinct.

14.Onewaytosparsifyaproximitymatrixisthefollowing:Foreachobject(rowinthematrix),setallentriesto0exceptforthosecorrespondingtotheobjectsk-nearestneighbors.However,thesparsifiedproximitymatrixistypicallynotsymmetric.

a. Ifobjectaisamongthek-nearestneighborsofobjectb,whyisbnotguaranteedtobeamongthek-nearestneighborsofa?

b. Suggestatleasttwoapproachesthatcouldbeusedtomakethesparsifiedproximitymatrixsymmetric.

15.Giveanexampleofasetofclustersinwhichmergingbasedontheclosenessofclustersleadstoamorenaturalsetofclustersthanmergingbasedonthestrengthofconnection(interconnectedness)ofclusters.

16.Table8.4 liststhetwonearestneighborsoffourpoints.CalculatetheSNNsimilaritybetweeneachpairofpointsusingthedefinitionofSNNsimilaritydefinedinAlgorithm8.11 .

Table8.4.Twonearestneighborsoffourpoints.

Point FirstNeighbor SecondNeighbor

1 4 3

2 3 4

3 4 2

4 3 1

17.ForthedefinitionofSNNsimilarityprovidedbyAlgorithm8.11 ,thecalculationofSNNdistancedoesnottakeintoaccountthepositionofsharedneighborsinthetwonearestneighborlists.Inotherwords,itmightbedesirabletogivehighersimilaritytotwopointsthatsharethesamenearestneighborsinthesameorroughlythesameorder.

a. DescribehowyoumightmodifythedefinitionofSNNsimilaritytogivehighersimilaritytopointswhosesharedneighborsareinroughlythesameorder.

b. Discusstheadvantagesanddisadvantagesofsuchamodification.

18.NameatleastonesituationinwhichyouwouldnotwanttouseclusteringbasedonSNNsimilarityordensity.

19.Grid-clusteringtechniquesaredifferentfromotherclusteringtechniquesinthattheypartitionspaceinsteadofsetsofpoints.

a. Howdoesthisaffectsuchtechniquesintermsofthedescriptionoftheresultingclustersandthetypesofclustersthatcanbefound?

b. Whatkindofclustercanbefoundwithgrid-basedclustersthatcannotbefoundbyothertypesofclusteringapproaches?(Hint:SeeExercise20inChapter7 ,page608.)

20.InCLIQUE,thethresholdusedtofindclusterdensityremainsconstant,evenasthenumberofdimensionsincreases.Thisisapotentialproblembecausedensitydropsasdimensionalityincreases;i.e.,tofindclustersinhigherdimensionsthethresholdhastobesetatalevelthatmaywellresultinthemergingoflow-dimensionalclusters.Commentonwhetheryoufeelthisistrulyaproblemand,ifso,howyoumightmodifyCLIQUEtoaddressthisproblem.

21.GivenasetofpointsinEuclideanspace,whicharebeingclusteredusingtheK-meansalgorithmwithEuclideandistance,thetriangleinequalitycanbeusedintheassignmentsteptoavoidcalculatingallthedistancesofeachpointtoeachclustercentroid.Provideageneraldiscussionofhowthismightwork.

22.InsteadofusingtheformuladerivedinCURE—seeEquation8.21 —wecouldrunaMonteCarlosimulationtodirectlyestimatetheprobabilitythatasampleofsizeswouldcontainatleastacertainfractionofthepointsfromacluster.UsingaMonteCarlosimulationcomputetheprobabilitythatasampleofsizescontains50%oftheelementsofaclusterofsize100,wherethetotalnumberofpointsis1000,andwherescantakethevalues100,200,or500.

9AnomalyDetection

Inanomalydetection,thegoalistofindobjectsthatdonotconformtonormalpatternsorbehavior.Often,anomalousobjectsareknownasoutliers,since,onascatterplotofthedata,theyliefarawayfromotherdatapoints.Anomalydetectionisalsoknownasdeviationdetection,becauseanomalousobjectshaveattributevaluesthatdeviatesignificantlyfromtheexpectedortypicalattributevalues,orasexceptionmining,becauseanomaliesareexceptionalinsomesense.Inthischapter,wewillmostlyusethetermsanomalyoroutlier.Thereareavarietyofanomalydetectionapproachesfromseveralareas,includingstatistics,machinelearning,anddatamining.Alltrytocapturetheideathatananomalousdataobjectisunusualorinsomewayinconsistentwithotherobjects.

Alt1oughunusualobjectsoreventsare,bydefinition,relativelyrare,theirdetectionandanalysisprovidescriticalinsightsthatareusefulinanumberofapplications.Thefollowingexamplesillustrateapplicationsforwhichanomaliesareofconsiderableinterest.

FraudDetection.Thepurchasingbehaviorofsomeonewhostealsacreditcardisoftendifferentfromthatoftheoriginalowner.Creditcardcompaniesattempttodetectatheftbylookingforbuyingpatternsthatcharacterizetheftorbynoticingachangefromtypicalbehavior.Similarapproachesarerelevantinmanydomainssuchasdetectinginsuranceclaimfraudandinsidertrading.IntrusionDetection.Unfortunately,attacksoncomputersystemsandcomputernetworksarecommonplace.Whilesomeoftheseattacks,suchasthosedesignedtodisableoroverwhelmcomputersandnetworks,areobvious,otherattacks,suchasthosedesignedtosecretlygatherinformation,aredifficulttodetect.Manyoftheseintrusionscanonlybedetectedbymonitoringsystemsandnetworksforunusualbehavior.EcosystemDisturbances.TheEarth’secosystemhasbeenexperiencingrapidchangesinthelastfewdecadesduetonaturaloranthropogenicreasons.Thisincludesanincreasedpropensityforextremeevents,suchasheatwaves,droughts,andfloods,whichhaveahugeimpactontheenvironment.Identifyingsuchextremeeventsfromsensorrecordingsandsatelliteimagesisimportantforunderstandingtheiroriginsandbehavior,aswellasfordevisingsustainableadaptationpolicies.MedicineandPublicHealth.Foraparticularpatient,unusualsymptomsortestresults,suchasananomalousMRIscan,mayindicatepotentialhealthproblems.However,whetheraparticulartest

resultisanomalousmaydependonmanyothercharacteristicsofthepatient,suchasage,sex,andgeneticmakeup.Furthermore,thecategorizationofaresultasanomalousornotincursacost—unneededadditionaltestsifapatientishealthyandpotentialharmtothepatientifaconditionisleftundiagnosedanduntreated.Thedetectionofemergingdiseaseoutbreaks,suchasH1N1-influenzaorSARS,whichresultinunusualandalarmingtestresultsinaseriesofpatients,isalsoimportantformonitoringthespreadofdiseasesandtakingpreventiveactions.AviationSafety.Sinceaircraftsarehighlycomplexanddynamicsystems,theyarepronetoaccidents—oftenwithdrasticconsequences—duetomechanical,environmentalorhumanfactors.Tomonitortheoccurrenceofsuchanomalies,mostcommercialairplanesareequippedwithalargenumberofsensorstomeasuredifferentflightparameters,suchasinformationfromthecontrolsystem,theavionicsandpropulsionsystems,andpilotactions.Identifyingabnormaleventsinthesesensorrecordings(e.g.,ananomaloussequenceofpilotactionsoranabnormallyfunctioningaircraftcomponent)canhelppreventaircraftaccidentsandpromoteaviationsafety.

Althoughmuchoftherecentinterestinanomalydetectionisdrivenbyapplicationsinwhichanomaliesarethefocus,historically,anomalydetection(and

removal)hasbeenviewedasadatapreprocessingtechniquetoeliminateerroneousdataobjectsthatmayberecordedbecauseofhumanerror,aproblemwiththemeasuringdevice,orthepresenceofnoise.Suchanomaliesprovidenointerestinginformationbutonlydistorttheanalysisofnormalobjects.Theidentificationandremovalofsucherroneousdataobjectsisnotthefocusofthischapter.Instead,theemphasisisondetectinganomalousobjectsthatareinterestingintheirownright.

9.1CharacteristicsofAnomalyDetectionProblemsAnomalydetectionproblemsarequitediverseinnatureastheyappearinmultipleapplicationdomainsunderdifferentsettings.Thisdiversityinproblemcharacteristicshasresultedinarichvarietyofanomalydetectionmethodsthatareusefulindifferentsituations.Beforewediscussthesemethods,itwillbeusefultodescribesomeofthekeycharacteristicsofanomalydetectionproblemsthatmotivatethedifferentstylesofanomalydetectionmethods.

9.1.1ADefinitionofanAnomaly

Animportantcharacteristicofananomalydetectionproblemisthewayananomalyisdefined.Sinceanomaliesarerareoccurrencesthatarenotfullyunderstood,theycanbedefinedindifferentwaysdependingontheproblemrequirements.However,thefollowinghigh-leveldefinitionofananomalyencompassesmostofthedefinitionscommonlyemployed.

Definition9.1.Ananomalyisanobservationthatdoesn’tfitthedistributionofthedatafornormalinstances,i.e.,isunlikelyunderthedistributionofthemajorityofinstances.

Wenotethefollowingpoints:

Thisdefinitiondoesnotassumethatthedistributioniseasytoexpressintermsofwell-knownstatisticaldistributions.Indeed,thedifficultyofdoingsoisthereasonthatmanyanomalydetectionapproachesusenon-statisticalapproaches.Nonetheless,theseapproachesaimtofinddataobjectsthatarenotcommon.Conceptually,wecanrankdataobjectsaccordingtotheprobabilityofseeingsuchanobjectorsomethingmoreextreme.Thelowertheprobability,themorelikelytheobjectisananomaly.Often,thereciprocaloftheprobabilityisusedasarankingscore.Again,thisisonlypracticalinsomecases.SuchapproachesarediscussedinSection9.3 .Therecanbevariouscausesofananomaly:noise,theobjectcomesfromanotherdistribution,e.g.,afewgrapefruitmixedwithoranges,ortheobjectisjustarareoccurrenceofdatafromthedistribution,e.g.,a7foottallperson.Asmentioned,wearenotinterestedinanomaliesduetonoise.

9.1.2NatureofData

Thenatureoftheinputdataplaysakeyroleindecidingthechoiceofasuitableanomalydetectiontechnique.Someofthecommoncharacteristicsoftheinputdataincludethenumberandtypesofattributes,andtherepresentationusedfordescribingeverydatainstance.

UnivariateorMultivariate

Ifthedatacontainsasingleattribute,thequestionofwhetheranobjectisanomalousdependsonwhethertheobject’svalueforthatattributeis

anomalous.However,ifthedataobjectsarerepresentedusingmanyattributes,adataobjectmayhaveanomalousvaluesforsomeattributesbutordinaryvaluesforotherattributes.Furthermore,anobjectmaybeanomalousevenifnoneofitsattributevaluesareindividuallyanomalous.Forexample,itiscommontohavepeoplewhoaretwofeettall(children)orare100poundsinweight,butuncommontohaveatwo-foottallpersonwhoweighs100pounds.Identifyingananomalyinamultivariatesettingisthuschallenging,particularlywhenthedimensionalityofthedataishigh.

RecordDataorProximityMatrix

Themostcommonapproachforrepresentingadatasetistouserecorddataoritsvariants,e.g.,adatamatrix,whereeverydatainstanceisdescribedusingthesamesetofattributes.However,forthepurposeofanomalydetection,itisoftensufficienttoknowhowdifferentaninstanceisincomparisontootherinstances.Hence,someanomalydetectionmethodsworkwithadifferentrepresentationoftheinputdataknownasaproximitymatrix,whereeveryentryinthematrixdenotesthepairwiseproximity(similarityordissimilarity)betweentwoinstances.Notethatadatamatrixcanalwaysbeconvertedtoaproximitymatrixbyusinganappropriateproximitymeasure.Also,asimilaritymatrixcanbeeasilyconvertedtoadistancematrixusinganyofthetransformationspresentedinSection2.4.1 .

AvailabilityofLabels

Thelabelofadatainstancedenoteswhethertheinstanceisnormaloranomalous.Ifwehaveatrainingsetwithlabelsforeverydatainstance,thentheproblemofanomalydetectiontranslatestoasupervisedlearning(classification)problem.Classificationtechniquesthataddresstheso-calledrareclassproblemareparticularlyrelevantbecauseanomaliesarerelativelyrarewithrespecttonormalobjects.SeeSection4.11 .

However,inmostpracticalapplications,wedonothaveatrainingsetwithaccurateandrepresentativelabelsofthenormalandanomalousclasses.Notethatobtaininglabelsoftheanomalousclassisespeciallychallengingbecauseoftheirrarity.Itisthusdifficultforahumanexperttocatalogeverytypeofanomalysincethepropertiesoftheanomalousclassareoftenunknown.Hence,mostanomalydetectionproblemsareunsupervisedinnature,i.e.,theinputdatadoesnothaveanylabels.Allanomalydetectionmethodspresentedinthischapteroperateintheunsupervisedsetting.

Notethatintheabsenceoflabels,itischallengingtodifferentiateanomaliesfromnormalinstancesgivenaninputdataset.However,anomaliestypicallyhavesomepropertiesthattechniquescantakeadvantageoftomakefindinganomaliespractical.Twokeypropertiesarethefollowing:

RelativelySmallinNumber

Sinceanomaliesareinfrequent,mostinputdatasetshaveapredominanceofnormalinstances.Theinputdatasetisthusoftenusedasanimperfectrepresentationofthenormalclassinmostanomalydetectiontechniques.However,theperformanceofsuchmethodsneedstoberobusttothepresenceofoutliersintheinputdata.Someanomalydetectionmethodsalsoprovideamechanismtospecifytheexpectednumberofoutliersintheinputdata.Suchmethodscanworkwithalargernumberofanomaliesinthedata.

SparselyDistributed

Anomalies,unlikenormalobjects,areoftenunrelatedtoeachotherandhencedistributedsparselyinthespaceofattributes.Indeed,thesuccessfuloperationofmostanomalydetectionmethodsdependsonanomaliesnotbeingtightlyclustered.However,someanomalydetectionmethodsare

specificallydesignedtofindclusteredanomalies(seeSection9.5.1 ),whichareassumedtoeitherbesmallinsizeordistantfromotherinstances.

9.1.3HowAnomalyDetectionisUsed

Therearetwodifferentwaysinwhichanygenericanomalydetectionmethodcanbeused.Inthefirstapproach,wearegivenaninputdatathatcontainsbothnormalandanomalousinstances,andarerequiredtoidentifyanomaliesinthisinputdata.Allanomalydetectionapproachespresentedinthischapterareabletooperateinthissetup.Inthesecondapproach,wearealsoprovidedwithtestinstances(appearingoneatatime)thatneedtobeidentifiedasanomalies.Mostanomalydetectionmethods(withafewexceptions)areabletousetheinputdatasettoprovideoutputsonnewtestinstances.Findinganomaliesbyfindinganomalousclusters—Section9.5.1—isoneoftheexceptions.

9.2CharacteristicsofAnomalyDetectionMethodsTocatertothediverseneedsofanomalydetectionproblems,anumberoftechniqueshavebeenexploredusingconceptsfromdifferentresearchdisciplines.Inthissection,weprovideahigh-leveldescriptionofsomeofthecommoncharacteristicsofanomalydetectionmethodsthatarehelpfulinunderstandingtheircommonalitiesanddifferences.

Model-basedvs.Model-freeManyapproachesforanomalydetectionusetheinputdatatobuildmodelsthatcanbeusedtoidentifywhetheratestinstanceisanomalousornot.Mostmodel-basedtechniquesforanomalydetectionbuildamodelofthenormalclassandidentifyanomaliesthatdonotfitthismodel.Forexample,wecanfitaGaussiandistributiontomodelthenormalclassandthenidentifyanomaliesthatdonotconformwelltothelearneddistribution.Theothertypeofmodel-basedtechniqueslearnsamodelofboththenormalandanomalousclasses,andidentifiesinstancesasanomaliesiftheyaremorelikelytobelongtotheanomalousclass.Althoughtheseapproachestechnicallyrequirerepresentativelabelsfrombothclasses,theyoftenmakeassumptionsaboutthenatureoftheanomalousclass,e.g.,thatanomaliesarerareandsparselydistributed,andthuscanworkeveninanunsupervisedsetting.

Inadditiontoidentifyinganomalies,model-basedmethodsprovideinformationaboutthenaturethenormalclassandsometimeseventheanomalousclass.However,theassumptionstheymakeaboutthepropertiesofnormaland

anomalousclassesmaynotholdtrueineveryproblem.Incontrast,model-freeapproachesdonotexplicitlycharacterizethedistributionofthenormaloranomalousclasses.Instead,theydirectlyidentifyinstancesasanomalieswithoutlearningmodelsfromtheinputdata.Forexample,aninstancecanbeidentifiedasananomalyifitisquitedifferentfromotherinstancesinitsneighborhood.Model-freeapproachesareoftenintuitiveandsimpletouse.

Globalvs.LocalPerspectiveAninstancecanbeidentifiedasananomalyeitherbyconsideringtheglobalcontext,e.g.,bybuildingamodeloverallnormalinstancesandusingthisglobalmodelforanomalydetection,orbyconsideringthelocalperspectiveofeverydatainstance.Specifically,ananomalydetectionapproachistermedlocalifitsoutputonagiveninstancedoesnotchangeifinstancesoutsideitslocalneighborhoodaremodifiedorremoved.Thedifferencebetweentheglobalandlocalperspectivecanresultinsignificantdifferencesintheresultsofananomalydetectionmethod,becauseanobjectmayseemunusualwithrespecttoallobjectsglobally,butnotwithrespecttoobjectsinitslocalneighborhood.Forexample,apersonwhoseheightis6feet5inchesisunusuallytallwithrespecttothegeneralpopulation,butnotwithrespecttoprofessionalbasketballplayers.

Labelvs.ScoreDifferentapproachesforanomalydetectionproducetheiroutputsindifferentformats.Themostbasictypeofoutputisabinaryanomalylabel:anobjectiseitheridentifiedasananomalyorasanormalinstance.However,labelsdonotprovideanyinformationaboutthedegreetowhichaninstanceisanomalous.Frequently,someofthedetectedanomaliesaremoreextreme

thanothers,whilesomeinstanceslabeledasnormalmaybeonthevergeofbeingidentifiedasanomalies.

Hence,manyanomalydetectionmethodsproduceananomalyscorethatindicateshowstronglyaninstanceislikelytobeananomaly.Ananomalyscorecaneasilybesortedandconvertedintoranks,sothatananalystcanbeprovidedwithonlythetop-mostscoringanomalies.Alternatively,acutoffthresholdcanbeappliedtoananomalyscoretoobtainbinaryanomalylabels.Thetaskofchoosingtherightthresholdisoftenlefttothediscretionoftheanalyst.However,sometimesthescoreshaveanassociatedmeaning,e.g.,statisticalsignificance(seeSection9.3 ),whichmakestheanalysisofanomalieseasierandmoreinterpretable.

Inthefollowingsections,weprovidebriefdescriptionsofsixtypesofanomalydetectionapproaches.Foreachtype,wewilldescribetheirbasicidea,keyfeatures,andunderlyingassumptionsusingillustrativeexamples.Attheendofeverysection,wealsodiscusstheirstrengthsandweaknessinhandlingdifferentaspectsofanomalydetectionproblems.Tofollowcommonpractice,wewillusethetermsoutlierandanomalyinterchangeablyintheremainderofthischapter.

9.3StatisticalApproachesStatisticalapproachesmakeuseofprobabilitydistributions(e.g.,theGaussiandistribution)tomodelthenormalclass.Akeyfeatureofsuchdistributionsisthattheyassociateaprobabilityvaluetoeverydatainstance,indicatinghowlikelyitisfortheinstancetobegeneratedfromthedistribution.Anomaliesarethenidentifiedasinstancesthatareunlikelytobegeneratedfromtheprobabilitydistributionofthenormalclass.

Therearetwotypesofmodelsthatcanbeusedtorepresenttheprobabilitydistributionofthenormalclass:parametricmodelsandnon-parametricmodels.Whileparametricmodelsusewell-knownfamiliesofstatisticaldistributionsthatrequireestimatingparametersfromthedata,non-parametricmodelsaremoreflexibleandlearnthedistributionofthenormalclassdirectlyfromtheavailabledata.Inthefollowing,wediscussbothofthesetypesofmodelsforanomalydetection.

9.3.1UsingParametricModels

Someofthecommontypesofparametricmodelsthatarewidelyusedfordescribingmanytypesofdatasets,includetheGaussiandistribution,thePoissondistribution,andthebinomialdistribution.Theyinvolveparametersthatneedtobelearnedfromthedata,e.g.,aGaussiandistributionrequiresidentifyingthemeanandvarianceparametersfromthedata.

Parametricmodelsarequiteeffectiveinrepresentingthebehaviorofthenormalclass,especiallywhenthenormalclassisknowntofollowaspecific

distribution.Theanomalyscorescomputedbyparametricmodelsalsohavestrongtheoreticalproperties,whichcanbeusedforanalyzingtheanomalyscoresandassessingtheirstatisticalsignificance.Inthefollowing,wediscusstheuseoftheGaussiandistributionformodelingthenormalclass,intheunivariateandmultivariatesettings.

UsingtheUnivariateGaussianDistributionTheGaussian(normal)distributionisoneofthemostfrequentlyuseddistributionsinstatistics,andwewilluseittodescribeasimpleapproachtostatisticaloutlierdetection.TheGaussiandistributionhastwoparameters,μandσ,whicharethemeanandstandarddeviation,respectively,andisrepresentedusingthenotation .Theprobabilitydensityfunction ofapointxundertheGaussiandistributionisgivenas

Figure9.1 showstheprobabilitydensityfunctionof .Wecanseethatthe declinesasxmovesfartherfromthecenterofthedistribution.Wecanthususethedistanceofapointxfromtheoriginasananomalyscore.AswewillseelaterinSection9.3.4 ,thisdistancevaluehasaninterpretationintermsofprobabilitythatcanbeusedtoassesstheconfidenceincallingxasanoutlier.

N(μ,σ) f(x)

f(x)=12πσ2e−(x−μ)22σ2 (9.1)

N(0,1)p(x)

Figure9.1.ProbabilitydensityfunctionofaGaussiandistributionwithameanof0andastandarddeviationof1.

IftheattributeofinterestxfollowsaGaussiandistributionwithmeanμandstandarddeviationσ,i.e., ,acommonapproachistotransformtheattributextoanewattributez,whichhasa distribution.Thiscanbedonebyusing ,whichiscalledthez-score.Notethat isdirectlyrelatedtotheprobabilitydensityofthepointxinEquation9.1 sincethatequationcanberewrittenasfollows:

TheparametersμandσoftheGaussiandistributioncanbeestimatedfromthetrainingdataofmostlynormalinstances,byusingthesamplemean asμandthesamplestandarddeviation asσ.However,ifwebelievethe

N(μ,σ)N(0,1)

z=(x−μ)/σ z2

p(x)=12πσ2e−z2/2 (9.2)

x¯sx

outliersaredistortingtheestimatesoftheseparameterstoomuch,morerobustestimatesofthesequantitiescanbeused—seeBibliographicNotes.

UsingtheMultivariateGaussianDistributionForadatasetcomprisedofmorethanonecontinuousattribute,wecanuseamultivariateGaussiandistributiontomodelthenormalclass.AmultivariateGaussiandistribution involvestwoparameters,themeanvectorμandthecovariancematrixΣ,whichneedtobeestimatedfromthedata.Theprobabilitydensityofapointxdistributedas isgivenby

wherepisthenumberofdimensionsofxand denotesthedeterminantofthecovariancematrixΣ.

InthecaseofamultivariateGaussiandistribution,thedistanceofapointxfromthecenterμcannotbedirectlyusedasaviableanomalyscore.Thisisbecauseamultivariatenormaldistributionisnotsymmetricalwithrespecttoitscenteriftherearecorrelationsbetweentheattributes.Toillustratethis,Figure9.2 showstheprobabilitydensityofatwo-dimensionalmultivariateGaussiandistributionwithmeanof(0,0)andacovariancematrixof

N(μ,Σ)

N(μ,Σ)

f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2, (9.3)

|Σ|

Σ=(1.000.750.753.00).

Figure9.2.ProbabilitydensityofpointsfortheGaussiandistributionusedtogeneratethepointsofFigure9.3 .

Figure9.3.Mahalanobisdistanceofpointsfromthecenterofatwo-dimensionalsetof2002points.

Theprobabilitydensityvariesasymmetricallyaswemoveoutwardfromthecenterindifferentdirections.Toaccountforthisfact,weneedadistancemeasurethattakestheshapeofthedataintoconsideration.TheMahalanobisdistanceisonesuchdistancemeasure.(SeeEquation2.27 onpage96.)TheMahalanobisdistancebetweenapointxandthemeanofthedata isgivenby

whereSistheestimatedcovariancematrixofthedata.NotethattheMahalanobisdistancebetweenxand isdirectlyrelatedtotheprobability

x─

Mahalanobis(x,x─)=(x−x─)S−1(x−x─)T, (9.4)

x─

densityofxinEquation9.3 ,when andSareusedasestimatesofμandΣ,respectively.(SeeExercise9 onpage751.)

Example9.1(OutliersinaMultivariateNormalDistribution).Figure9.3 showstheMahalanobisdistance(fromthemeanofthedistribution)forpointsinatwo-dimensionaldataset.ThepointsA(−4,4)andB(5,5)areoutliersthatwereaddedtothedataset,andtheirMahalanobisdistanceisindicatedinthefigure.Theother2000pointsofthedatasetwererandomlygeneratedusingthedistributionusedforFigure9.2 .

BothAandBhavelargeMahalanobisdistances.However,eventhoughAisclosertothecenter(thelargeblackxat(0,0))asmeasuredbyEuclideandistance,itisfartherawaythanBintermsoftheMahalanobisdistancebecausetheMahalanobisdistancetakestheshapeofthedistributionintoaccount.Inparticular,pointBhasaEuclideandistanceof

andaMahalanobisdistanceof24,whilethepointAhasaEuclideandistanceof andaMahalanobisdistanceof35.

TheaboveapproachesassumethatthenormalclassisgeneratedfromasingleGaussiandistribution.Notethatthismaynotalwaysbethecase,especiallyiftherearemultipletypesofnormalclassesthathavedifferentmeansandvariances.Insuchcases,wecanuseaGaussianmixturemodel(asdescribedinChapter8.2.2 )torepresentthenormalclass.Foreachpoint,thesmallestMahalanobisdistanceofthepointtoanyoftheGaussiandistributionsiscomputedandusedastheanomalyscore.Thisapproachisrelatedtotheclustering-basedapproachesforanomalydetection,whichwillbedescribedinSection9.5 .

x─

5242

9.3.2UsingNon-parametricModels

Analternativeformodelingthedistributionofthenormalclassistousekerneldensityestimation-basedtechniquesthatemploykernelfunctions(describedinSection8.3.3 )toapproximatethedensityofthenormalclassfromtheavailabledata.Thisresultsintheconstructionofanon-parametricprobabilitydistributionofthenormalclass,suchthatregionswithadenseoccurrenceofnormalinstanceshavehighprobabilityandvice-versa.Notethatkernel-basedapproachesdonotassumethatthedataconformstoanyknownfamilyofdistributionsbutinsteadderivethedistributionpurelyfromthedata.Havinglearnedaprobabilitydensityforthenormalclassusingthekerneldensityapproach,theanomalyscoreofaninstanceiscomputedastheinverseofitsprobabilitywithrespecttothelearneddensity.

Asimplernon-parametricapproachtomodelingthenormalclassistobuildahistogramofthenormaldata.Forexample,ifthedatacontainsasinglecontinuousattribute,thenwecanconstructbinsfordifferentrangesoftheattribute,usingtheequal-widthdiscretizationtechniquedescribedinSection2.3.6 .Wecanthencheckifanewtestinstancefallsinanyofthebinsofthehistogram.Ifitdoesnotfallinanyofthebins,wecanidentifyitasananomaly.Otherwise,wecanusetheinverseoftheheight(frequency)ofthebininwhichitfallsasitsanomalyscore.Thisapproachisknownasthefrequency-basedorcounting-basedapproachforanomalydetection.

Akeystepinusingfrequency-basedapproachesforanomalydetectionischoosingthesizeofthebinusedforconstructingthehistogram.Asmallbinsizecanfalselyidentifymanynormalinstancesasanomalous,sincetheymightfallinemptyorsparselypopulatedbins.Ontheotherhand,ifthebinsizeistoolarge,manyanomalousinstancesmayfallinheavilypopulatedbins

andgounnoticed.Thus,choosingtherightbinsizeischallenging,andoftenrequirestryingmultiplesizeoptionsorusingexpertknowledge.

9.3.3ModelingNormalandAnomalousClasses

Thestatisticalapproachesdescribedsofaronlymodelthedistributionofthenormalclassbutnottheanomalousclass.Theyassumethatthetrainingsetpredominantlyhasnormalinstances.However,ifthereareoutlierspresentinthetrainingdata,whichiscommoninmostpracticalapplications,thelearningoftheprobabilitydistributionscorrespondingtothenormalclassmaybedistorted,resultinginpooridentificationofanomalies.

Here,wepresentastatisticalapproachforanomalydetectionthatcantolerateaconsiderablefraction(λ)ofoutliersinthetrainingset,providedthattheoutliersareuniformlydistributed(andthusnotclustered)intheattributespace.Thisapproachmakesuseofamixturemodelingtechniquetolearnthedistributionofnormalandanomalousclasses.ThisapproachissimilartotheExpectation-Maximization(EM)basedtechniqueintroducedinthecontextofclusteringinChapter8.2.2 .Notethatλ),thefractionofoutliers,islikeaprior.

Thebasicideaofthisapproachistoassumethatinstancesaregeneratedwithprobabilityλfromtheanomalousclass,whichhasuniformdistribution,,andwithprobability andfromthenormalclass,whichhasthe

distribution, ,whereθrepresentstheparametersofthedistribution.Theapproachforassigningtraininginstancestothenormalandanomalyclassescanbedescribedasfollows.Initially,alltheobjectsareassignedto

pA 1−λ

fM(θ)

thenormalclassandthesetofanomalousobjectsisempty.AteveryiterationoftheEMalgorithm,objectsaretransferredfromthenormalclasstotheanomalyclasstoimprovethelikelihoodoftheoveralldata.Let and bethesetofnormalandanomalousobjects,respectively,atiterationt.ThelikelihoodofthedatasetD, ,anditslog-likelihood, ,arethengivenbythefollowingequations:

where and arethenumberofobjectsinthenormalandanomalyclasses,respectively,and representstheparametersofthedistributionofthenormalclass,whichcanbeestimatedusing .Ifthetransferofanobjectxfrom to resultsinasignificantincreaseinthelog-likelihoodofthedata(greaterthanathresholdc),thenxisassignedtothesetofoutliers.Thesetofoutliers keepsgrowingtillweachievethemaximumlikelihoodofthedatausing and .ThisapproachissummarizedinAlgorithm9.1 .

Algorithm9.1Likelihood-basedoutlier

detection.

Mt At

ℒt(D) logℒt(D)

ℒt(D)=∏xi∈DP(xi)=((1−λ)|Mt|∏xi∈MtPM(xi,θt))(λ|At|∏xi∈AtPA(xi))

(9.5)

logℒt(D)=|Mt|log(1−λ)+∑xi∈MtlogPM(xi,θt)+|At|log λ+∑xi∈AtlogPA(xi)

(9.6)

|Mt| |At|θt

MtMt At

At At

Mt At

Initialization:Attime ,let containalltheobjects,whileisempty.foreachobjectxthatbelongsto doMovexfrom to toproducethenewdatasets and

.

1: t=0 Mt At

2: Mt3: Mt At At+1

Mt+1

Becausethenumberofnormalobjectsislargecomparedtothenumberofanomalies,thedistributionofthenormalobjectsmaynotchangemuchwhenanobjectismovedtothesetofanomalies.Inthatcase,thecontributionofeachnormalobjecttotheoveralllikelihoodofthenormalobjectswillremainrelativelyconstant.Furthermore,eachobjectmovedtothesetofanomaliescontributesafixedamounttothelikelihoodoftheanomalies.Thus,theoverallchangeinthetotallikelihoodofthedatawhenanobjectismovedtothesetofanomaliesisroughlyequaltotheprobabilityoftheobjectunderauniformdistribution(weightedbyλ)minustheprobabilityoftheobjectunderthedistributionofthenormaldataobjects(weightedby ).Consequently,thesetofanomalieswilltendtoconsistofthoseobjectsthathavesignificantlyhigherprobabilityunderauniformdistributionthanunderthedistributionofthenormalobjects.

Inthesituationjustdiscussed,theapproachdescribedbyAlgorithm9.1 isroughlyequivalenttoclassifyingobjectswithalowprobabilityunderthedistributionofnormalobjectsasoutliers.Forexample,whenappliedtothepointsinFigure9.3 ,thistechniquewouldclassifypointsAandB(andotherpointsfarfromthemean)asoutliers.However,ifthedistributionofthenormalobjectschangessignificantlyasanomaliesareremovedorthedistributionoftheanomaliescanbemodeledinamoresophisticatedmanner,thentheresultsproducedbythisapproachwillbedifferentthantheresultsofsimplyclassifyinglow-probabilityobjectsasoutliers.Also,thisapproachcanwork

Computethenewlog-likelihoodofD,Computethedifference,if ,wherecissomethresholdthenClassifyxasananomaly.Incrementtbyoneanduse and inthenextiteration.endifendfor

4: logℒt+1(D)5: Δ=logℒt+1(D)−logℒt(D)6: Δ>c7:8: Mt+1 At+19:10:

1−λ

evenwhenthedistributionofnormalobjectsismulti-modal,e.g.,byusingamixtureofGaussiandistributionsfor .Also,conceptually,itshouldbepossibletousethisapproachwithdistributionsotherthanGaussian.

9.3.4AssessingStatisticalSignificance

Statisticalapproachesprovideawaytoassignameasureofconfidencefortheinstancesdetectedasanomalies.Forexample,sincetheanomalyscorescomputedbystatisticalapproacheshaveaprobabilisticmeaning,wecanapplyathresholdtothesescoreswithstatisticalguarantees.Alternatively,itispossibletodefinestatisticaltests(alsotermedasdiscordancytests)thatcanidentifythestatisticalsignificanceofaninstancebeingidentifiedasananomalybyastatisticalapproach.Manyofthesediscordancytestsarehighlyspecializedandassumealevelofstatisticalknowledgebeyondthescopeofthistext.Thus,weillustratethebasicideaswithasimpleexamplethatusesunivariateGaussiandistributions,andreferthereadertotheBibliographicNotesforfurtherpointers.

ConsidertheGaussiandistribution showninFigure9.1 .AsdiscussedpreviouslyinSection9.3.1 ,mostoftheprobabilitydensityiscenteredaroundzeroandthereislittleprobabilitythatanobject(value)belongingto willoccurinthetailsofthedistribution.Forinstance,thereisonlyaprobabilityof0.0027thatanobjectliesbeyondthecentralareabetween standarddeviations.Moregenerally,ifcisaconstantandxistheattributevalueofanobject,thentheprobabilitythat decreasesrapidlyascincreases.Let .Table9.1 showssomesamplevaluesforcandthecorrespondingvaluesforαwhenthedistributionis .Notethatavaluethatismorethan4standarddeviationsfromthemeanisaone-inten-thousandoccurrence.

fM(θ)

N(0,1)

N(0,1)

±3|x|≥c

α=prob(|x|≥c)N(0,1)

Table9.1.Samplepairs foraGaussiandistributionwithmean0andstandarddeviation1.

c αfor

1.00 0.3173

1.50 0.1336

2.00 0.0455

2.50 0.0124

3.00 0.0027

3.50 0.0005

4.00 0.0001

Thisinterpretationofthedistanceofapointfromthecentercanbeusedasthebasisofatesttoassesswhetheranobjectisanoutlier,usingthefollowingdefinition.

Definition9.2(OutlierforaSingleGaussianAttribute).AnobjectwithattributevaluexfromaGaussiandistributionwithmeanof0andstandarddeviation1isanoutlierif

(c,α),α=prob(|x|≥c)

N(0,1)

N(0,1)

|x|≥c, (9.7)

wherecisaconstantchosensothat ,wherePrepresentsprobability.

Tousethisdefinitionitisnecessarytospecifyavalueforα.Fromtheviewpointthatunusualvalues(objects)indicateavaluefromadifferentdistribution,αindicatestheprobabilitythatwemistakenlyclassifyavaluefromthegivendistributionasanoutlier.Fromtheviewpointthatanoutlierisararevalueofa distribution,αspecifiesthedegreeofrareness.

Moregenerally,foraGaussiandistributionwithmeanμandstandarddeviationσ,wecanfirstcomputethezscoreofxandthenapplytheabovetestonx.Inpractice,thisworkswellwhenμandσareestimatedfromalargepopulation.Amoresophisticatedstatisticalprocedure(Grubbs’test),whichtakesintoaccountthedistortionofparameterestimatescausedbyoutliers,isexploredinExercise7 onpage750.

Theapproachtooutlierdetectionpresentedhereisequivalenttotestingdataobjectsforstatisticalsignificanceandclassifyingthestatisticallysignificantobjectsasanomalies.ThisisdiscussedinmoredetailinChapter10 .

9.3.5StrengthsandWeaknesses

Statisticalapproachestooutlierdetectionhaveafirmtheoreticalfoundationandbuildonstandardstatisticaltechniques.Whenthereissufficientknowledgeofthedataandthetypeoftestthatshouldbeapplied,theseapproachesarestatisticallyjustifiableandcanbeveryeffective.Theycanalsoprovideconfidenceintervalsassociatedwiththeanomalyscores,which

P(|x|≥c)=α

N(0,1)

canbeveryhelpfulinmakingdecisionsabouttestinstances,e.g.,determiningthresholdsontheanomalyscore.

However,ifthewrongmodelischosen,thenanormalinstancecanbeerroneouslyidentifiedasanoutlier.Forexample,thedatamaybemodeledascomingfromaGaussiandistribution,butmayactuallycomefromadistributionthathasahigherprobability(thantheGaussiandistribution)ofhavingvaluesfarfromthemean.Statisticaldistributionswiththistypeofbehaviorarecommoninpracticeandareknownasheavy-taileddistributions.Also,wenotethatwhilethereareawidevarietyofstatisticaloutliertestsforsingleattributes,farfeweroptionsareavailableformultivariatedata,andthesetestscanperformpoorlyforhigh-dimensionaldata.

9.4Proximity-basedApproachesProximity-basedmethodsidentifyanomaliesasthoseinstancesthataremostdistantfromtheotherobjects.Thisreliesontheassumptionthatnormalinstancesarerelatedandhenceappearclosetoeachother,whileanomaliesaredifferentfromtheotherinstancesandhencearerelativelyfarfromotherinstances.Sincemanyoftheproximity-basedtechniquesarebasedondistances,theyarealsoreferredtoasdistance-basedoutlierdetectiontechniques.

Proximity-basedapproachesaremodel-freeanomalydetectiontechniques,sincetheydonotconstructanexplicitmodelofthenormalclassforcomputingtheanomalyscore.Theymakeuseofthelocalperspectiveofeverydatainstancetocomputeitsanomalyscore.Theyaremoregeneralthanstatisticalapproaches,sinceitisofteneasiertodetermineameaningfulproximitymeasureforadatasetthantodetermineitsstatisticaldistribution.Inthefollowing,wepresentsomeofthebasicproximity-basedapproachesfordefiningananomalyscore.Primarily,thesetechniquesdifferinthewaytheyanalyzethelocalityofadatainstance.

9.4.1Distance-basedAnomalyScore

Oneofthesimplestwaystodefineaproximity-basedanomalyscoreofadatainstancexistousethedistancetoits nearestneighbor, .Ifaninstancexhasmanyotherinstanceslocatedclosetoit(characteristicofthenormalclass),itwillhavealowvalueof .Onotherhand,an

kth dist(x,k)

dist(x,k)

anomalousinstancexwillbequitedistantfromitsk-neighboringinstancesandwouldthushaveahighvalueof .

Figure9.4 showsasetofpointsinatwo-dimensionalspacethathavebeenshadedaccordingtotheirdistancetothe nearestneighbor,(where ).NotethatpointChasbeencorrectlyassignedahighanomalyscore,asitislocatedfarawayfromotherinstances.

Figure9.4.Anomalyscorebasedonthedistancetofifthnearestneighbor.

Notethat canbequitesensitivetothevalueofk.Ifkistoosmall,e.g.,1,thenasmallnumberofoutlierslocatedclosetoeachothercanshowalowanomalyscore.Forexample,Figure9.5 showsanomalyscoresusing

forasetofnormalpointsandtwooutliersthatarelocatedclosetoeachother(shadingreflectsanomalyscores).NotethatbothCanditsneighborhavealowanomalyscore.Ifkistoolarge,thenitispossibleforallobjectsinaclusterthathasfewerthankobjectstobecomeanomalies.Forexample,Figure9.6 showsadatasetthathasasmallclusterofsize5andalarger

dist(x,k)

kth dist(x,k)k=5

dist(x,k)

k=1

clusterofsize30.For ,theanomalyscoreofallpointsinthesmallerclusterisveryhigh.

Figure9.5.Anomalyscorebasedonthedistancetothefirstnearestneighbor.Nearbyoutliershavelowanomalyscores.

Figure9.6.

k=5

Anomalyscorebasedondistancetothefifthnearestneighbor.Asmallclusterbecomesanoutlier.

Analternativedistance-basedanomalyscorethatismorerobusttothechoiceofkistheaveragedistancetothefirstk-nearestneighbors, .Indeed, iswidelyusedinanumberofapplicationsasareliableproximity-basedanomalyscore.

9.4.2Density-basedAnomalyScore

Thedensityaroundaninstancecanbedefinedas ,wherenisthenumberofinstanceswithinaspecifieddistancedfromtheinstance,andisthevolumeoftheneighborhood.Since isconstantforagivend,thedensityaroundaninstanceisoftenrepresentedusingthenumberofinstancesnwithinafixeddistanced.ThisdefinitionissimilartotheoneusedbytheDBSCANclusteringalgorithminSection7.4 .Fromadensity-basedviewpoint,anomaliesareinstancesthatareinregionsoflowdensity.Hence,ananomalywillhaveasmallernumberofinstanceswithinadistancedthananormalinstance.

Similartothetrade-offinchoosingtheparameterkindistance-basedmeasures,itischallengingtochoosetheparameterdindensity-basedmeasures.Ifdistoosmall,thenmanynormalinstancescanincorrectlyshowlowdensityvalues.Ifdistoolarge,thenmanyanomaliesmayhavedensitiesthataresimilartonormalinstances.

Notethatthedistance-basedanddensity-basedviewsofproximityarequitesimilartoeachother.Torealizethis,considerthek-nearestneighborsofadatainstancex,whosedistancetothe nearestneighborisgivenby

avg.dist(x,k)avg.dist(x,k)

n/V(d)V(d)

V(d)

kth

.Inthisapproach, providesameasureofthedensityaroundx,usingadifferentvalueofdforeveryinstance.If islarge,thedensityaroundxissmall,andvice-versa.Distance-basedanddensity-basedanomalyscoresthusfollowaninverserelationship.Thiscanbeusedtodefinethefollowingmeasuresofdensitythatarebasedonthetwodistancemeasures, and :

9.4.3RelativeDensity-basedAnomalyScore

Theaboveproximity-basedapproachesonlyconsiderthelocalityofanindividualinstanceforcomputingitsanomalyscore.Inscenarioswherethedatacontainsregionsofvaryingdensities,suchmethodswouldnotbeabletocorrectlyidentifyanomalies,asthenotionofanormallocalitywouldchangeacrossregions.

Toillustratethis,considerthesetoftwo-dimensionalpointsinFigure9.7 .Thisfigurehasoneratherlooseclusterofpoints,anotherdenseclusterofpoints,andtwopoints,CandD,whicharequitefarfromthesetwoclusters.Assigninganomalyscorestopointsaccordingto with correctlyidentifiespointCtobeananomaly,butshowsalowscoreforpointD.Infact,thescoreforDismuchlowerthanmanypointsthatarepartoftheloosecluster.Tocorrectlyidentifyanomaliesinsuchdatasets,weneedanotionofdensitythatisrelativetothedensitiesofneighboringinstances.Forexample,pointDinFigure9.7 hasahigherabsolutedensitythanpointA,butitsdensityislowerrelativetoitsnearestneighbors.

dist(x,k) dist(x,k)dist(x,k)

dist(x,k) avg.dist(x,k)

density(x,k)=1/dist(x,k),avg.density(x,k)=1/avg.dist(x,k).

dist(x,k) k=5

Figure9.7.Anomalyscorebasedonthedistancetothefifthnearestneighbor,whenthereareclustersofvaryingdensities.

Therearemanywaystodefinetherelativedensityofaninstance.Forapointx,Oneapproachistocomputetheratiooftheaveragedensityofitsk-nearestneighbors, to tothedensityofx,asfollows:

Therelativedensityofapointishighwhentheaveragedensityofpointsinitsneighborhoodissignificantlyhigherthanthedensityofthepoint.

Notethatbyreplacing with intheaboveequation,wecanobtainamorerobustmeasureofrelativedensity.TheaboveapproachissimilartothatusedbytheLocalOutlierFactor(LOF)score,whichisawidely-usedmeasurefordetectinganomaliesusingrelativedensity.(SeeBibliographicNotes.)However,LOFusesasomewhatdifferentdefinitionofdensitytoachieveresultsthataremorerobust.

y1 yk

relative density(x,k)=∑i=1kdensity(yi,k)/kdensity(x,k). (9.8)

density(x,k) avg.density(x,k)

Example9.2(RelativeDensityAnomalyDetection).Figure9.8 showstheperformanceoftherelativedensity-basedanomalydetectionmethodontheexampledatasetusedpreviouslyinFigure9.7 .TheanomalyscoreofeverypointiscomputedusingEquation9.8 (with ).Theshadingofeverypointrepresentsitsscore,i.e.,pointswithahigherscorearedarker.WehavelabeledpointsA,C,andD,whichhavethelargestanomalyscores.Respectively,thesepointsarethemostextremeanomaly,themostextremepointwithrespecttothecompactsetofpoints,andthemostextremepointintheloosesetofpoints.

Figure9.8.

k=10

Relativedensity(LOF)outlierscoresfortwo-dimensionalpointsofFigure9.7 .

9.4.4StrengthsandWeaknesses

Proximity-basedapproachesarenon-parametricinnatureandhencearenotrestrictedtoanyparticularformofdistributionofthenormalandanomalousclasses.Theyhaveabroadapplicabilityoverawiderangeofanomalydetectionproblemswhereareasonableproximitymeasurecanbedefinedbetweeninstances.Theyarequiteintuitiveandvisuallyappealing,sinceproximity-basedanomaliescanbeinterpretedvisuallywhenthedatacanbedisplayedintwo-orthree-dimensionalscatterplots.

However,theeffectivenessofproximity-basedmethodsdependsgreatlyonthechoiceofthedistancemeasure.Definingdistancesinhigh-dimensionalspacescanbechallenging.Insomecases,dimensionalityreductiontechniquescanbeusedtomaptheinstancesintoalowerdimensionalfeaturespace.Proximity-basedmethodscanthenbeappliedinthereducedspacefordetectinganomalies.Anotherchallengecommontoallproximity-basedmethodsistheirhighcomputationalcomplexity.Givennpoints,computingtheanomalyscoreforeverypointrequiresconsideringallpair-wisedistances,resultinginan runningtime.Forlargedatasetsthiscanbetooexpensive,althoughspecializedalgorithmscanbeusedtoimproveperformanceinsomecases,e.g.,withlow-dimensionaldatasets.Choosingtherightvalueofparameters(kord)inproximity-basedmethodsisalsodifficultandoftenrequiresdomainexpertise.

O(n2)

9.5Clustering-basedApproachesClustering-basedmethodsforanomalydetectionuseclusterstorepresentthenormalclass.Thisreliesontheassumptionthatnormalinstancesappearclosetoeachotherandhencecanbegroupedintoclusters.Anomaliesarethenidentifiedasinstancesthatdonotfitwellintheclusteringofthenormalclass,orappearinsmallclustersthatarefarapartfromtheclustersofthenormalclass.Clustering-basedmethodscanbecategorizedintotwotypes:methodsthatconsidersmallclustersasanomalies,andmethodsthatdefineapointasanomalousifdoesnotfittheclusteringwell,typicallyasmeasuredbydistancefromaclustercenter.Wedescribebothtypesofclustering-basedmethodsnext.

9.5.1FindingAnomalousClusters

Thisapproachassumesthepresenceofclusteredanomaliesinthedata,wheretheanomaliesappearintightgroupsofsmallsize.Clusteredanomaliesappearwhentheanomaliesarebeinggeneratedfromthesameanomalousclass.Forexample,anetworkattackmayhaveacommonpatterninitsoccurrence,possiblybecauseofacommonattacker,whoappearsinsimilarwaysinmultipleinstances.

Clustersofanomaliesaregenerallysmallinsize,sinceanomaliesarerareinnature.Theyarealsoexpectedtobequitedistantfromtheclustersofthenormalclass,sinceanomaliesdonotconformtonormalpatternsorbehavior.Hence,abasicapproachfordetectinganomalousclustersistoclusterthe

overalldataandflagclustersthatareeithertoosmallinsizeortoofarfromotherclusters.

Forinstance,ifweuseaprototype-basedmethodforclusteringtheoveralldata,e.g.,usingK-means,everyclustercanberepresentedbyitsprototype,e.g.,thecentroidofthecluster.Wecanthentreateveryprototypeasapointandstraightforwardlyidentifyclustersthataredistantfromtherest.Asanotherexample,ifweareusinghierarchicaltechniquessuchasMIN,MAX,orGroupAverage—seeSection7.3 —thenanomaliesareoftenidentifiedasthoseinstancesthatareinsmallclustersorremainsingletonsevenafteralmostallotherpointshavebeenclustered.

9.5.2FindingAnomalousInstances

Fromaclusteringperspective,anotherwayofdescribingananomalyisasaninstancethatcannotbeexplainedwellbyanyofthenormalclusters.Hence,abasicapproachforanomalydetectionistofirstclusterallthedata(comprisedmainlyofnormalinstances)andthenassessthedegreetowhicheveryinstancebelongstoitsrespectivecluster.Forexample,ifweuseK-meansclustering,thedistanceofaninstancetoitsclustercentroidrepresentshowstronglyitbelongstothecluster.Instancesthatarequitedistantfromtheirrespectiveclustercentroidscanthusbeidentifiedasanomalies.

Althoughclustering-basedmethodsforanomalydetectionarequiteintuitiveandsimpletouse,thereareanumberofconsiderationsthatmustbekeptinmindwhileusingthem,aswediscussinthefollowing.

AssessingtheExtenttoWhichanObject

BelongstoaClusterForprototype-basedclusters,thereareseveralwaystoassesstheextenttowhichaninstancebelongstoacluster.Onemethodistomeasurethedistanceofaninstancefromitsclusterprototypeandconsiderthisastheanomalyscoreoftheinstance.However,iftheclustersareofdifferingdensities,thenwecanconstructananomalyscorethatmeasurestherelativedistanceofaninstancefromtheclusterprototypewithrespecttothedistancesoftheotherinstancesinthecluster.Anotherpossibility,providedthattheclusterscanbeaccuratelymodeledintermsofGaussiandistributions,istousetheMahalanobisdistance.

Forclusteringtechniquesthathaveanobjectivefunction,wecanassignananomalyscoretoaninstancethatreflectstheimprovementintheobjectivefunctionwhenthatinstanceiseliminatedfromtheoveralldata.However,suchanapproachisoftencomputationallyintensive.Forthatreason,thedistance-basedapproachesofthepreviousparagraphareusuallypreferred.

Example9.3(Clustering-BasedExample).ThisexampleisbasedonthesetofpointsshowninFigure9.7 .Prototype-basedclusteringinthisexampleusestheK-meansalgorithm,andtheanomalyscoreofapointiscomputedintwoways:(1)bythepoint’sdistancefromitsclosestcentroid,and(2)bythepoint’srelativedistancefromitsclosestcentroid,wheretherelativedistanceistheratioofthepoint’sdistancefromthecentroidtothemediandistanceofallpointsintheclusterfromthecentroid.Thelatterapproachisusedtoadjustforthelargedifferenceindensitybetweencompactandlooseclusters.

TheresultinganomalyscoresareshowninFigures9.9 and9.10 .Asbefore,theanomalyscore,measuredinthiscasebythedistanceor

relativedistance,isindicatedbytheshading.Weusetwoclustersineachcase.Theapproachbasedonrawdistancehasproblemswiththedifferingdensitiesoftheclusters,e.g.,Disnotconsideredanoutlier.Fortheapproachbasedonrelativedistances,thepointsthathavepreviouslybeenidentifiedasoutliersusingLOF(A,C,andD)alsoshowupasanomalieshere.

Figure9.9.Distanceofpointsfromclosestcentroid.

Figure9.10.Relativedistanceofpointsfromclosestcentroid.

ImpactofOutliersontheInitialClusteringClusteringbasedschemesareoftensensitivetothepresenceofoutliersinthedata.Hence,thepresenceofoutlierscandegradethequalityofclusterscorrespondingtothenormalclasssincetheseclustersarediscoveredbyclusteringtheoveralldata,whichiscomprisedofnormalandanomalousinstances.Toaddressthisissue,thefollowingapproachcanbeused:instancesareclustered,outliers,whicharethepointsfarthestfromanycluster,areremoved,andthentheinstancesareclusteredagain.ThisapproachisusedateveryiterationoftheK-meansalgorithm.TheK-means–algorithmisanexampleofsuchanalgorithm.Whilethereisnoguaranteethatthisapproachwillyieldoptimalresults,itiseasytouse.

Amoresophisticatedapproachistohaveaspecialgroupforinstancesthatdonotcurrentlyfitwellinanycluster.Thisgrouprepresentspotentialoutliers.Astheclusteringprocessproceeds,clusterschange.Instancesthatnolongerbelongstronglytoanyclusterareaddedtothesetofpotentialoutliers,whileinstancescurrentlyintheoutliergrouparetestedtoseeiftheynowstronglybelongtoaclusterandcanberemovedfromthesetofpotentialoutliers.Theinstancesremaininginthesetattheendoftheclusteringareclassifiedasoutliers.Again,thereisnoguaranteeofanoptimalsolutionoreventhatthisapproachwillworkbetterthanthesimpleronedescribedpreviously.

TheNumberofClusterstoUseClusteringtechniquessuchasK-meansdonotautomaticallydeterminethenumberofclusters.Thisisaproblemwhenusingclustering-basedmethodsforanomalydetection,sincewhetheranobjectisconsideredananomalyornotmaydependonthenumberofclusters.Forinstance,agroupof10objectsmayberelativelyclosetooneanother,butmaybeincludedaspartofalargerclusterifonlyafewlargeclustersarefound.Inthatcase,eachofthe10pointscouldberegardedasananomaly,eventhoughtheywouldhaveformedaclusterifalargeenoughnumberofclustershadbeenspecified.

Aswithsomeoftheotherissues,thereisnosimpleanswertothisproblem.Onestrategyistorepeattheanalysisfordifferentnumbersofclusters.Anotherapproachistofindalargenumberofsmallclusters.Theideaisthat(1)smallerclusterstendtobemorecohesiveand(2)ifanobjectisananomalyevenwhentherearealargenumberofsmallclusters,thenitislikelyatrueanomaly.Thedownsideisthatgroupsofanomaliesmayformsmallclustersandthusescapedetection.

9.5.3StrengthsandWeaknesses

Clustering-basedtechniquescanoperateinanunsupervisedsettingastheydonotrequiretrainingdataconsistingofonlynormalinstances.Alongwithidentifyinganomalies,thelearnedclustersofthenormalclasshelpinunderstandingthenatureofthenormaldata.Someclusteringtechniques,suchasK-means,havelinearornear-lineartimeandspacecomplexityandthus,ananomalydetectiontechniquebasedonsuchalgorithmscanbehighlyefficient.However,theperformanceofclustering-basedanomalydetectionmethodsisheavilydependentuponthenumberofclustersusedaswellasthepresenceofoutliersinthedata.AsdiscussedinChapters7 and8 ,eachclusteringalgorithmissuitableonlyforacertaintypeofdata;hencetheclusteringalgorithmneedstobechosencarefullytoeffectivelycapturetheclusterstructureinthedata.

9.6Reconstruction-basedApproachesReconstruction-basedtechniquesrelyontheassumptionthatthenormalclassresidesinaspaceoflowerdimensionalitythantheoriginalspaceofattributes.Inotherwords,therearepatternsinthedistributionofthenormalclassthatcanbecapturedusinglower-dimensionalrepresentations,e.g.,byusingdimensionalityreductiontechniques.

Toillustratethis,consideradatasetofnormalinstances,whereeveryinstanceisrepresentedusingpcontinuousattributes, .Ifthereisahiddenstructureinthenormalclass,wecanexpecttoapproximatethisdatausingfewerthanpderivedfeatures.Onecommonapproachforderivingusefulfeaturesfromadatasetistouseprincipalcomponentsanalysis(PCA),asdescribedinSection2.3.3 .ByapplyingPCAontheoriginaldata,weobtainpprincipalcomponents, ,whereeveryprincipalcomponentisalinearcombinationoftheoriginalattributes.Eachprincipalcomponentcapturesthemaximumamountofvariationintheoriginaldatasubjecttotheconstraintthatitmustbeorthogonaltotheprecedingprincipalcomponents.Thus,theamountofvariationcaptureddecreasesforeachsuccessiveprincipalcomponent,andhence,itispossibletoapproximatetheoriginaldatausingthetopkprincipalcomponents, .Indeed,ifthereisahiddenstructureinthenormalclass,wecanexpecttoobtainagoodapproximationusingasmallernumberoffeatures, .

Once,wehavederivedasmallersetofkfeatures,wecanprojectanynewdatainstancextoitsk-dimensionalrepresentationy.Moreover,wecanalsore-projectybacktotheoriginalspaceofpattributes,resultinginareconstructionofx.Letusdenotethisreconstructionas andthesquaredEuclideandistancebetweenxand asthereconstructionerror.

x1, . . . , xp

y1, . . . , yp

y1, . . . , yk

k<p

x^x^

Sincethelow-dimensionalfeaturesarespecificallylearnedtoexplainmostofthevariationinthenormaldata,wecanexpectthereconstructionerrortobelowfornormalinstances.However,thereconstructionerrorishighforanomalousinstances,astheydonotconformtothehiddenstructureofthenormalclass.Thereconstructionerrorcanthusbeusedasaneffectiveanomalydetectionscore.

Asanillustrationofareconstruction-basedapproachforanomalydetection,consideratwo-dimensionaldatasetofnormalinstances,shownascirclesinFigure9.11 .Theblacksquaresareanomalousinstances.Thesolidblacklineshowsthefirstprincipalcomponentlearnedfromthisdata,whichcorrespondstothedirectionofmaximumvarianceofnormalinstances.

Figure9.11.Reconstructionofatwo-dimensionaldatausingasingleprincipalcomponent(shownassolidblackline).

Wecanseethatmostofthenormalinstancesarecenteredaroundthisline.

ReconstructionError(x)=||x−x^||2

Thissuggeststhatthefirstprincipalcomponentprovidesagoodapproximationtothenormalclassusingalower-dimensionalrepresentation.Usingthisrepresentation,wecanprojecteverydatainstancextoapointontheline.Thisprojection, ,servesasareconstructionoftheoriginalinstanceusingasingleprincipalcomponent.

Thedistancebetweenxand ,whichcorrespondstothereconstructionerrorofx,isshownasdashedlinesinFigure9.11 .Wecanseethat,sincethefirstprincipalcomponenthasbeenlearnedtobestfitthenormalclass,thereconstructionerrorsofthenormalinstancesarequitesmallinvalue.However,thereconstructionerrorsforanomalousinstances(shownassquares)arehigh,sincetheydonotadheretothestructureofthenormalclass.

AlthoughPCAprovidesasimpleapproachforcapturinglow-dimensionalrepresentations,itcanonlyderivefeaturesthatarelinearcombinationsoftheoriginalattributes.Whenthenormalclassexhibitsnonlinearpatterns,itisdifficulttocapturethemusingPCA.Insuchscenarios,theuseofanartificialneuralnetworkknownanautoencoderprovidesonepossibleapproachfornonlineardimensionalityreductionandreconstruction.AsdescribedinSection4.7 ,autoencodersarewidelyusedinthecontextofdeeplearningtoderivecomplexfeaturesfromthetrainingdatainanunsupervisedsetting.

Anautoencoder(alsoreferredtoasanautoassociatororamirroringnetwork)isamulti-layerneuralnetwork,wherethenumberofinputandoutputneuronsisequaltothenumberoforiginalattributes.Figure9.12 showsthegeneralarchitectureofanautoencoder,whichinvolvestwobasicsteps,encodinganddecoding.Duringencoding,adatainstancexistransformedtoalow-dimensionalrepresentationy,usinganumberofnonlineartransformationsintheencodinglayers.Noticethatthenumberofneuronsreducesateveryencodinglayer,soastolearnlow-dimensionalrepresentationsfromthe

x^

x^

originaldata.Thelearnedrepresentationyisthenmappedbacktotheoriginalspaceofattributesusingthedecodinglayers,resultinginareconstructionofx,denotedby .Thedistancebetweenxand (thereconstructionerror)isthenusedasameasureofananomalyscore.

Figure9.12.Abasicarchitectureoftheautoencoder.

Inordertolearnanautoencoderfromaninputdatasetcomprisingprimarilyofnormalinstances,wecanusethebackpropagationtechniquesintroducedinthecontextofartificialneuralnetworksinSection4.7 .Theautoencoderschemeprovidesapowerfulapproachforlearningcomplexandnonlinearrepresentationsofthenormalclass.Anumberofvariantsofthebasicautoencoderschemedescribedabovehavealsobeenexploredtolearnrepresentationsindifferenttypesofdatasets.Forexample,thedenoisingautoencoderisabletorobustlylearnnonlinearrepresentationsfromthetrainingdata,eveninthepresenceofnoise.Formoredetailsonthedifferenttypesofautoencoders,seetheBibliographicNotes.

x^ x^

9.6.1StrengthsandWeaknesses

Reconstruction-basedtechniquesprovideagenericapproachformodelingthenormalclassthatdoesnotrequiremanyassumptionsaboutthedistributionofnormalinstances.Theyareabletolearnarichvarietyofrepresentationsofthenormalclassbyusingabroadfamilyofdimensionalityreductiontechniques.Theycanalsobeusedinthepresenceofirrelevantattributes,sinceanattributethatdoesnotshareanyrelationshipwiththeotherattributesislikelytobeignoredintheencodingstep,asitwouldnotbeofmuchuseinreconstructingthenormalclass.However,sincethereconstructionerroriscomputedbymeasuringthedistancebetweenxandintheoriginalspaceofattributes,performancecansufferwhenthenumber

ofattributesislarge.

x^

9.7One-classClassificationOne-classclassificationapproacheslearnadecisionboundaryintheattributespacethatenclosesallnormalobjectsononesideoftheboundary.Figure9.13 showsanexampleofadecisionboundaryintheone-classsetting,wherepointsbelongingtoonesideoftheboundary(shaded)belongtothenormalclass.Thisisincontrasttobinaryclassificationapproachesintroducedinchapters3 and4 thatlearnboundariestoseparateobjectsfromtwoclasses.

Figure9.13.Thedecisionboundaryofaone-classclassificationproblemattemptstoenclosethenormalinstancesonthesamesideoftheboundary.

One-classclassificationpresentsauniqueperspectiveonanomalydetection,where,insteadoflearningthedistributionofthenormalclass,thefocusisonmodelingtheboundaryofthenormalclass.Fromanoperationalstandpoint,learningtheboundaryisindeedwhatweneedtodistinguishanomaliesfrom

normalobjects.InthewordsofVladimirVapnik,“Oneshouldsolvethe[classification]problemdirectlyandneversolveamoregeneralproblem[suchaslearningthedistributionofthenormalclass]asanintermediatestep.”

Inthissection,wepresentanSVM-basedaone-classapproach,knownasone-classSVM,whichonlyusestraininginstancesfromthenormalclasstolearnitsdecisionboundary.ContrastthiswithanormalSVM—seeSection4.9 –whichusestraininginstancesfromtwoclasses.Thisinvolvestheuseofkernelsandanovel“origintrick,”describedasfollows.(SeeSection2.4.7 foranintroductiontokernelmethods.)

9.7.1UseofKernels

Inordertolearnanonlinearboundarythatenclosesthenormalclass,wetransformthedatatoahigherdimensionalspacewherethenormalclasscanbeseparatedusingalinearhyperplane.Thiscanbedonebyusingafunctionϕthatmapseverydatainstancexintheoriginalspaceofattributestoapoint

inthetransformedhigh-dimensionalspace.(Thechoiceofthemappingfunctionwillbecomeclearlater.)Inthetransformedspace,thetraininginstancescanbeseparatedusingalinearhyperplanedefinedbyparameters

asfollows.

where denotestheinnerproductbetweenvectorsxandy.Ideally,wewantalinearhyperplanethatplacesallofthenormalinstancesononeside.Hence,wewant tobesuchthat ifxbelongstothenormalclass,and ifxbelongstotheanomalyclass.

ϕ(x)

(w,ρ)

⟨w,ϕ(x)⟩=ρ,

⟨x,y⟩

(w,ρ) ⟨w,ϕ(x)⟩>ρ⟨w,ϕ(x)⟩<ρ

Let bethesetoftraininginstancesbelongingtothenormalclass.SimilartotheuseofkernelsinSVMs(seeChapter4 ),wedefinewasalinearcombinationof ’s:

Theseparatinghyperplanecanthenbedescribedusing ’sandρasfollows.

Notethattheaboveequationdealswithinnerproductsof inthetransformedspacetodescribethehyperplane.Tocomputesuchinnerproducts,wecanmakeuseofkernelfunctions, ,introducedinSection2.4.7 .Notethatkernelfunctionsareextensivelyusedforlearningnonlinearboundariesinbinaryclassificationproblems,e.g.,usingkernel-SVMspresentedinChapter4 .However,learningnonlinearboundariesintheone-classsettingischallengingintheabsenceofanyinformationabouttheanomalyclassduringtraining.Toovercomethischallenge,one-classSVMusesthe“origintrick”tolearntheseparatinghyperplane,whichworksbestwithcertaintypesofkernelfunctions.Thisapproachcanbebrieflydescribedasfollows.

9.7.2TheOriginTrick

ConsidertheGaussiankernelthatiscommonlyusedforlearningnonlinearboundaries,whichcanbedefinedas

{x1,x2, . . .xn}

ϕ(xi)

w=∑i=1nαiϕ(xi).

αi

∑i=1nαi⟨ϕ(xi),ϕ(x)⟩=ρ,

ϕ(x)

κ(x, y)=⟨ϕ(x), ϕ(y)⟩

κ(x,y)=exp(−||x−y||22σ2),

where denotesthelengthofavectorandσisahyper-parameter.BeforeweusetheGaussiankerneltolearnaseparatinghyperplaneintheone-classsetting,itwillbeworthwhiletofirstunderstandwhatthetransformedspace

ofaGaussiankernellookslike.TherearetwoimportantpropertiesofthetransformedspaceofGaussiankernelsthatareusefulforunderstandingtheintuitionbehindone-classSVMs:

1. Everypointismappedtoahypersphereofunitradius.Torealizethis,considerthekernelfunction ofapointxontoitself.Since ,

Thisimpliesthatthelengthof isequalto1,andhence,residesonahypersphereofunitradiusforallx.

2. Everypointismappedtothesameorthantinthetransformedspace.Foranytwopointsxandy,since ,theanglebetween and isalwayssmallerthan .Hence,themappingsofallpointslieinthesameorthant(high-dimensionalanalogueof“quadrant”)inthetransformedspace.

Forillustrativepurposes,Figure9.14 showsaschematicvisualizationofthetransformedspaceofGaussiankernels,usingtheabovetwoconsiderations.Theblackdotsrepresentthemappingsoftraininginstancesinthetransformedspace,whichlieonaquarterarcofacirclewithunitradius.Inthisview,theobjectiveofone-classSVMistolearnalinearhyperplanethatcanseparatetheblackdotsfromthemappingsofanomalousinstances,whichwouldalsoresideonthesamequarterarc.Therearemanypossiblehyperplanesthatcanachievethistask,twoofwhichareshowninFigure9.14 asdashedlines.Inordertochoosethebesthyperplane(shownasaboldline),wemakeusetheprincipleofstructuralriskminimization,discussed

||.||

ϕ(x)

κ(x,x)||x−x||2=0

κ(x,x)=⟨ϕ(x),ϕ(x)⟩=||ϕ(x)||2=1.

ϕ(x) ϕ(x)

κ(x,y)=⟨ϕ(x),ϕ(y)⟩≥0ϕ(x) ϕ(y) π/2

inChapter4 inthecontextofSVM.Therearethreemainrequirementsthatweseekintheoptimalhyperplanedefinedbyparameters :

Figure9.14.Illustratingtheconceptofone-classSVMinthetransformedspace.

1. Thehyperplaneshouldhavealarge“margin”orasmallvalueof.Havingalargemarginensuresthatthemodelissimpleandhencelesssusceptibletothephenomenonofoverfitting.

2. Thehyperplaneshouldbeasdistantfromtheoriginaspossible.Thisensuresatightrepresentationofpointsontheuppersideofthehyperplane(correspondingtothenormalclass).NoticefromFigure9.14 thatthedistanceofahyperplanefromtheoriginisessentially

.Hence,maximizingρtranslatestomaximizingthedistanceofthehyperplanefromtheorigin.

(w,ρ)

||w||2

ρ||w||

3. Inthestyleof“soft-margin”SVMs,ifsomeofthetraininginstanceslieontheoppositesideofthehyperplane(correspondingtotheanomalyclass),thenthedistanceofsuchpointsfromthehyperplaneshouldbeminimized.Notethatitisimportantforananomalydetectionalgorithmtoberobusttoasmallnumberofoutliersinthetrainingsetasthatisquitecommoninreal-worldproblems.AnexampleofananomaloustraininginstanceisshowninFigure9.14 asthelower-mostblackdotonthequarterarc.Ifatraininginstance liesontheoppositesideofthehyperplane(correspondingtotheanomalyclass),itsdistancefromthehyperplane,asmeasuredbyitsslackvariable ,shouldbekeptsmall.If liesonthesidecorrespondingtothenormalclass,then .

Theabovethreerequirementsprovidethefoundationoftheoptimizationobjectiveofone-classSVM,whichcanbeformallydescribedasfollows:

wherenisthenumberoftraininginstancesand isahyper-parameterthatmaintainsatrade-offbetweenreducingthemodelcomplexityandimprovingthecoverageofthedecisionboundaryinkeepingthetraininginstancesonthesameside.

NoticethesimilarityoftheaboveequationtotheoptimizationobjectiveofbinarySVM,introducedinChapter4 .However,akeydifferenceinone-classSVMisthattheconstraintsareonlydefinedforthenormalclassbutnottheanomalyclass.Atafirstglance,thismightseemtobeaseriousproblem,becausethehyperplaneisheldbyconstraintsfromoneside(correspondingtothenormalclass)butisunconstrainedfromtheotherside.However,with

xi

ξi xiξi=0

minw,ρ,ξ12||w||2−ρ+1nv∑i=1nξi, (9.9)

subjectto⟨w,ϕ(xi)⟩≥ρ−ξi, ξi≥0,

ν∈(0, 1]

thehelpofthe“origintrick,”one-classSVMisabletoovercomethisinsufficiencybymaximizingthedistanceofthehyperplanefromtheorigin.Fromthisperspective,theoriginactsasasurrogatesecondclassandthelearnedhyperplaneattemptstoseparatethenormalclassfromthissecondclassinamannersimilartothewayabinarySVMseparatestwoclasses.

Equation9.7.2isaninstanceofaquadraticprogrammingproblem(QPP)withlinearinequalityconstraints,whichissimilartotheoptimizationproblemofbinarySVM.Hence,theoptimizationproceduresdiscussedinChapter4forlearningabinarySVMcanbedirectlyappliedforsolvingEquation9.7.2.Thelearnedone-classSVMcanthenbeappliedonatestinstancetoidentifyifitbelongstothenormalclassortheanomalyclass.Further,ifatestinstanceisidentifiedasananomaly,itsdistancefromthehyperplanecanbeseenasanestimateofitsanomalyscore.

Thehyper-parameterνofone-classSVMhasaspecialinterpretation.Itrepresentsanupperboundonthefractionoftraininginstancesthatcanbetoleratedasanomalieswhilelearningthehyperplane.Thismeansthatnνrepresentsthemaximumnumberoftraininginstancesthatcanbeplacedontheothersideofthehyperplane(correspondingtotheanomalyclass).Alowvalueofνassumesthatthetrainingsethasasmallernumberofoutliers,whileahighvalueofνensuresthatthelearningofthehyperplaneisrobusttoalargenumberofoutliersduringtraining.

Figure9.15 showsthelearneddecisionboundaryforanexampletrainingsetofsize200using .WecanseethatthetrainingdataconsistsofmostlynormalinstancesgeneratedfromaGaussiandistributioncenteredat

.However,therearealsosomeoutliersintheinputdatathatdonotconformthedistributionofthenormalclass.With ,theone-classSVMisabletoplaceatmost20traininginstancesontheothersideofthehyperplane(correspondingtothenormalclass).Thisresultsinadecisionboundarythat

ν=0.1(

0, 0)ν=0.1

robustlyenclosesthemajorityofnormalinstances.Ifweinsteaduse ,wewouldonlyhavethebudgettotolerateatmost10outliersinthetrainingset,resultinginthedecisionboundaryshowninFigure9.16(a) .Wecanseethatthisdecisionboundaryassignsamuchlargerregiontothenormalclassthanisnecessary.Ontheotherhand,thedecisionboundarylearnedusing isshowninFigure9.16(b) ,whichappearstobemuchmorecompactasitcantolerateupto40outliersinthetrainingdata.Thechoiceofνthusplaysacrucialroleinthelearningofthedecisionboundaryinone-classSVMs.

Figure9.15.Decisionboundaryofone-classSVMwithν=0.1.

ν=0.05

ν=0.2

Figure9.16.Decisionboundariesofone-classSVMforvaryingvaluesofν.

9.7.3StrengthsandWeaknesses

One-classSVMsleveragetheprincipleofstructuralriskminimizationinthelearningofthedecisionboundary,whichhasstrongtheoreticalfoundations.Theyhavetheabilitytostrikeabalancebetweenthesimplicityofthemodelandtheeffectivenessoftheboundaryinenclosingthedistributionofthenormalclass.Byusingthehyper-parameterν,theyprovideabuilt-inmechanismtoavoidoutliersinthetrainingdata,whichisoftencommoninreal-worldproblems.However,asillustratedinFigure9.16 ,thechoiceofνsignificantlyimpactsthepropertiesofthelearneddecisionboundary.Choosingtherightvalueofνisdifficult,sincethehyper-parameterselectiontechniquesdiscussedinChapter4 areonlyapplicableinthemulticlasssetting,whereitispossibletodefinevalidationerrorrates.Also,theuseofa

Gaussiankernelrequiresarelativelylargetrainingsizetoeffectivelylearnnonlineardecisionboundariesintheattributespace.Further,likeregularSVM,one-classSVMhasahighcomputationalcost.Hence,itisexpensivetotrain,especiallywhenthetrainingsetislarge.

9.8InformationTheoreticApproachesTheseapproachesassumethatthenormalclasscanberepresentedusingcompactrepresentations,alsoknownascodes.Insteadofexplicitlylearningsuchrepresentations,thefocusofinformationtheoreticapproachesistoquantifytheamountofinformationrequiredforencodingthem.Ifthenormalclassshowssomestructureorpattern,wecanexpecttoencodeitusingasmallnumberofbits.Anomaliescanthenbeidentifiedasinstancesthatintroduceirregularitiesinthedata,whichincreasetheoverallinformationcontentofthedataset.Thisisanadmissibledefinitionofananomalyinanoperationalsetting,sinceanomaliesareoftenassociatedwithanelementofsurprise,astheydonotconformtothepatternsorbehaviorofthenormalclass.

Thereareanumberofapproachesforquantifyingtheinformationcontent(alsoreferredtoascomplexity)ofadataset.Forexample,ifthedatasetcontainsacategoricalvariable,wecanassessitsinformationcontentusingtheentropymeasure,describedinSection2.3.6 .Fordatasetswithothertypesofattributes,othermeasuressuchastheKolmogorovcomplexitycanbeused.Intuitively,theKolmogorovcomplexitymeasuresthecomplexityofadatasetbythesizeofthesmallestcomputerprogram(writteninapre-specifiedlanguage)thatcanreproducetheoriginaldata.Amorepracticalapproachistocompressthedatausingstandardcompressiontechniques,andusethesizeoftheresultingcompressedfileasameasureoftheinformationcontentoftheoriginaldata.

Abasicinformationtheoreticapproachforanomalydetectioncanbedescribedasfollows.LetusdenotetheinformationcontentofadatasetDas

ConsidercomputingtheanomalyscoreofadatainstancexinD.IfweInfo(D).

removexfromD,wecanmeasuretheinformationcontentoftheremainingdataas Ifxisindeedananomaly,itwouldshowahighvalueof

Thishappensbecauseanomaliesareexpectedtobesurprising,andthus,theireliminationshouldresultinasubstantialreductionintheinformationcontent.Wecanthususe asameasureofanomalyscore.

Typically,thereductionininformationcontentismeasuredbyeliminatingasubsetofinstances(thataredeemedanomalous)andnotjustasingleinstance.Thisisbecausemostmeasuresofinformationcontentarenotsensitivetotheeliminationofasingleinstance,e.g.,thesizeofacompresseddatafiledoesnotchangesubstantiallybyremovingasingledataentry.ItisthusnecessarytoidentifythesmallestsubsetofinstancesXthatshowthelargestvalueof uponelimination.Thisisanon-trivialproblemrequiringexponentialtimecomplexity,althoughapproximatesolutionswithlineartimecomplexityhavealsobeenproposed.(SeeBibliographicNotes.)

Example9.4.Givenasurveyreportofthe and ofacollectionofparticipants,wewanttoidentifythoseparticipantsthathaveunusualheightsandweights.Both and canberepresentedascategoricalvariablesthattakethreevalues:{low,medium,high}.Table9.2 showsthedatafortheweightandheightinformationof100participants,whichhasanentropyof2.08.Wecanseethatthereisapatternintheheightandweightdistributionofnormalparticipants,sincemostparticipantsthathaveahighvalueof alsohaveahighvalueof ,andvice-versa.However,thereare5participantsthathaveahigh valuebutlow value,whichisquiteunusual.By

Info(Dx).

Gain(x)=Info(D)−Info(Dx).

Gain(x)

Gain(X)

eliminatingthese5instances,theentropyoftheresultingdatasetbecomes1.89,resultinginagainof

Table9.2.Surveydataofweightandheightof100participants.

Frequency

low low 20

low medium 15

medium medium 40

high high 20

high low 5

9.8.1StrengthsandWeaknesses

Informationtheoreticapproachesoperateintheunsupervisedsetting,astheydonotrequireaseparatetrainingsetofnormal-onlyinstances.Theydonotmakemanyassumptionsaboutthestructureofthenormalclassandaregenericenoughtobeappliedwithdatasetsofvaryingtypesandproperties.However,theperformanceofinformationtheoreticapproachesdependsheavilyonthechoiceofthemeasureusedforcapturingtheinformationcontentofadataset.Themeasureshouldbesuitablychosensothatitissensitivetotheeliminationofasmallnumberofinstances.Thisisoftenachallenge,sincecompressiontechniquesareoftenrobusttosmalldeviations,renderingthemusefulonlywhenanomaliesarelargeinnumbers.Further,informationtheoreticapproachessufferfromhighcomputationalcost,makingthemexpensivetoapplyonlargedatasets.

2.08−1.89=0.19.

9.9EvaluationofAnomalyDetectionWhenclasslabelsareavailabletodistinguishbetweenanomaliesandnormaldata,thentheeffectivenessofananomalydetectionschemecanbeevaluatedbyusingmeasuresofclassificationperformancediscussedinSection4.11 .Sincetheanomalousclassisusuallymuchsmallerthanthenormalclass,measuressuchasprecision,recall,andfalsepositiveratearemoreappropriatethanaccuracy.Inparticular,thefalsepositiverate,whichisoftenreferredtothefalsealarmrate,oftendeterminesthepracticalityoftheanomalydetectionschemesincetoomanyfalsealarmsrenderananomalydetectionsystemuseless.

Ifclasslabelsarenotavailable,thenevaluationischallenging.Formodel-basedapproaches,theeffectivenessofoutlierdetectioncanbejudgedwithrespecttotheimprovementinthegoodnessoffitofthemodelonceanomaliesareeliminated.Similarlyforinformationtheoreticapproaches,theinformationgaingivesameasureoftheeffectiveness.Forreconstruction-basedapproaches,thereconstructionerrorprovidesameasurethatcanbeusedforevaluation.

Theevaluationpresentedinthelastparagraphisanalogoustotheunsupervisedevaluationmeasuresforclusteranalysis,wheremeasures—seeSection7.5 ,suchasthesumofthesquarederror(SSE)orthesilhouetteindex,canbecomputedevenwhenclasslabelsarenotpresent.Suchmeasureswerereferredtoas“internal”measuresbecausetheyuseonlyinformationpresentinthedataset.Thesameistrueoftheanomalyevaluationmeasuresmentionedinthelastparagraph,i.e.,theyareinternalmeasures.Thekeypointisthattheanomaliesofinterestforaparticularapplicationmaynotbethosethatananomalydetectionalgorithmlabelsas

anomalies,justastheclusterlabelsproducedbyaclusteringalgorithmmaynotbeconsistentwiththeclasslabelsprovidedexternally.Inpractice,thismeansthatselectingandtuningananomalydetectionapproachbasedonfeedbackfromtheusersofsuchasystem.

Amoregeneralwaytoevaluatetheresultsofanomalydetectionistolookatthedistributionoftheanomalyscores.Thetechniquesthatwehavediscussedassumethatonlyarelativelysmallfractionofthedataconsistsofanomalies.Thus,themajorityofanomalyscoresshouldberelativelylow,withasmallerfractionofscorestowardthehighend.(Thisassumesthatahigherscoreindicatesaninstanceismoreanomalous.)Thus,bylookingatthedistributionofthescoresviaahistogramordensityplot,wecanassesswhethertheapproachweareusinggeneratesscoresthatbehaveinareasonablemanner.Weillustratewithanexample.

Example9.5.(DistributionofAnomalyScores.).Figures9.17 and9.18 showtheanomalyscoresoftwoclustersofpoints.Bothhave100points,buttheleftmostclusterislessdense.Figure9.17 ,whichusestheaveragedistancetothe neighbor(averageKNNdist),showshigheranomalyscoresforthepointsinthelessdensecluster.Incontrast,Figure9.18 ,whichusestheLOFforitsanomalyscoring,showssimilarscoresbetweenthetwoclusters.

kth

Figure9.17.Anomalyscorebasedonaveragedistancetofifthnearestneighbor.

Figure9.18.AnomalyscorebasedonLOFusingfivenearestneighbors.

ThehistogramsoftheaverageKNNdistandtheLOFscoreareshowninfigures9.19 and9.20 ,respectively.ThehistogramoftheLOFscoresshowsmostpointswithsimilaranomalyscoresandafewpointswithsignificantlylargervalues.ThehistogramoftheaverageKNNdistshowsabimodaldistribution.

Figure9.19.Histogramofanomalyscorebasedonaveragedistancetothefifthnearestneighbor.

Figure9.20.HistogramofLOFanomalyscoreusingfivenearestneighbors.

ThekeypointisthatthedistributionofanomalyscoresshouldlooksimilartothatoftheLOFscoresinthisexample.Theremaybeoneormoresecondarypeaksinthedistributionasonemovestotheright,butthesesecondarypeaksshouldonlycontainarelativelysmallfractionofthe

points,andnotalargefractionofthepointsaswiththeaverageKNNdistapproach.

9.10BibliographicNotesAnomalydetectionhasalonghistory,particularlyinstatistics,whereitisknownasoutlierdetection.RelevantbooksonthetopicarethoseofAggarwal[623],BarnettandLewis[627],Hawkins[648],andRousseeuwandLeroy[683].ThearticlebyBeckmanandCook[629]providesageneraloverviewofhowstatisticianslookatthesubjectofoutlierdetectionandprovidesahistoryofthesubjectdatingbacktocommentsbyBernoulliin1777.Alsoseetherelatedarticles[630,649].AnothergeneralarticleonoutlierdetectionistheonebyBarnett[626].ArticlesonfindingoutliersinmultivariatedataincludethosebyDaviesandGather[639],GnanadesikanandKettenring[646],RockeandWoodruff[681],RousseeuwandvanZomerenand[685],andScott[690].Rosner[682]providesadiscussionoffindingmultipleoutliersatthesametime.

SurveysbyChandolaetal.[633]andHodgeandAustin[651]provideextensivecoverageofoutlierdetectionmethods,asdoesarecentbookonthetopicbyAggarwal[623].MarkouandSingh[674,675]giveatwo-partreviewoftechniquesfornoveltydetectionthatcoversstatisticalandneuralnetworktechniques,respectively.Pimentoetal.[678]isanotherreviewofnoveltydetectionapproaches,includingmanyofthemethodsdiscussedinthischapter.

Statisticalapproachesforanomalydetectionintheunivariatecasearewellcoveredbythebooksinthefirstparagraph.Shyuetal.[692]useanapproachbasedonprincipalcomponentsandtheMahalanobisdistancetoproduceanomalyscoresformultivariatedata.AnexampleofthekerneldensityapproachforanomalydetectionisgivenbySchubertetal.[688].ThemixturemodeloutlierapproachdiscussedinSection9.3.3 isfromEskin

[641].Anapproachbasedonthe measureisgivenbyYeandChen[695].Outlierdetectionbasedongeometricideas,suchasthedepthofconvexhulls,hasbeenexploredinpapersbyJohnsonetal.[654],Liuetal.[673],andRousseeuwetal.[684].

Thenotionofadistance-basedoutlierandthefactthatthisdefinitioncanincludemanystatisticaldefinitionsofanoutlierwasdescribedbyKnorretal.[663–665].Ramaswamyetal.[680]proposeanefficientdistance-basedoutlierdetectionprocedurethatgiveseachobjectanoutlierscorebasedonthedistanceofitsk-nearestneighbor.EfficiencyisachievedbypartitioningthedatausingthefirstphaseofBIRCH(Section8.5.2 ).Chaudharyetal.[634]usek-dtreestoimprovetheefficiencyofoutlierdetection,whileBayandSchwabacher[628]userandomizationandpruningtoimproveperformance.

Forrelativedensity-basedapproaches,thebestknowntechniqueisthelocaloutlierfactor(LOF)(Breunigetal.[631,632]),whichgrewoutofDBSCAN.AnotherlocallyawareanallydetectionalgorithmisLOCIbyPapadimitriouetal.[677].AmorerecentviewofthelocalapproachisgivenbySchubertetal.[689].Proximitiescanbeviewedasagraph.Theconnectivity-basedoutlierfactor(COF)byTangetal.[694]isagraph-basedapproachtolocaloutlierdetection.AsurveyofgraphbasedapproachesisprovidedbyAkogluetal.[625].

Highdimensionalityposessignificantproblemsfordistance-anddensity-basedapproaches.Adiscussionofoutlierremovalinhigh-dimensionalspacecanbefoundinthepapersbyAggarwalandYu[624]andDunaganandVempala[640].Zimeketal.provideasurveyofanomalydetectionapproachesforhigh-dimensionalnumericaldata[696].

χ2

Clusteringandanomalydetectionhavealongrelationship.InChapters7and8 ,weconsideredtechniques,suchasBIRCH,CURE,DENCLUE,DB-SCAN,andSNNdensity-basedclustering,whichspecificallyincludetechniquesforhandlinganomalies.StatisticalapproachesthatfurtherdiscussthisrelationshiparedescribedinpapersbyScott[690]andHardinandRocke[647].TheK-means–algorithm,whichcansimultaneouslyhandleclusteringandoutliers,wasproposedbyChawlaandGionis[637].

Ourdiscussionofreconstruction-basedapproachesfocusedonaneuralnetwork-basedapproach,i.e.,theautoencoder.Morebroadly,adiscussionofapproachesintheareaofneuralnetworkscanbefoundinpapersbyGhoshandSchwartzbard[645],Sykacek[693],andHawkinsetal.[650],whodiscussreplicatornetworks.TheoneclassSVMapproachforanomalydetectionwascreatedbySchölkopfetal.[686]andimprovedbyLietal.[672].Moregenerally,techniquesforoneclassclassificationaresurveyedin[662].TheuseofinformationmeasuresinanomalydetectionisdescribedbyLeeandXiang[671].

Inthischapter,wefocusedonunsupervisedanomalydetection.Supervisedanomalydetectionfallsintothecategoryofrareclassclassification.WorkonrareclassdetectionincludestheworkofJoshietal.[655–659].Therareclassproblemisalsosometimesreferredtoastheimbalanceddatasetproblem.OfrelevanceareanAAAIworkshop(Japkowicz[653]),anICMLworkshop(Chawlaetal.[635]),andaspecialissueofSIGKDDExplorations(Chawlaetal.[636]).

EvaluationofunsupervisedanomalydetectionapproacheswasdiscussedinSection9.9 .SeealsothediscussioninChapter8 ofthebookbyAggarwal[623].Insummary,evaluationapproachesarequitelimited.Forsupervisedanomalydetection,anoverviewofcurrentapproachesforevaluationcanbefoundinSchubertetal.[687].

Inthischapter,wehavefocusedonbasicanomalydetectionschemes.Wehavenotconsideredschemesthattakeintoaccountthespatialortemporalnatureofthedata.Shekharetal.[691]provideadetaileddiscussionoftheproblemofspatialoutliersandpresentaunifiedapproachtospatialoutlierdetection.AdiscussionofthechallengesforanomalydetectioninclimatedataisprovidedbyKawaleetal.[660].

TheissueofoutliersintimeserieswasfirstconsideredinastatisticallyrigorouswaybyFox[643].Muirhead[676]providesadiscussionofdifferenttypesofoutliersintimeseries.AbrahamandChuang[622]proposeaBayesianapproachtooutliersintimeseries,whileChenandLiu[638]considerdifferenttypesofoutliersintimeseriesandproposeatechniquetodetectthemandobtaingoodestimatesoftimeseriesparameters.WorkonfindingdeviantorsurprisingpatternsintimeseriesdatabaseshasbeenperformedbyJagadishetal.[652]andKeoghetal.[661].

Animportantapplicationareaforanomalydetectionisintrusiondetection.SurveysoftheapplicationsofdataminingtointrusiondetectionaregivenbyLeeandStolfo[669]andLazarevicetal.[668].Inadifferentpaper,Lazarevicetal.[667]provideacomparisonofanomalydetectionroutinesspecifictonetworkintrusion.Garciaetal.[644]providearecentsurveyofanomalydetectionfornetworkintrusiondetection.AframeworkforusingdataminingtechniquesforintrusiondetectionisprovidedbyLeeetal.[670].Clustering-basedapproachesintheareaofintrusiondetectionincludeworkbyEskinetal.[642],LaneandBrodley[666],andPortnoyetal.[679].

Bibliography[622]B.AbrahamandA.Chuang.OutlierDetectionandTimeSeries

Modeling.Technometrics,31(2):241–248,May1989.

[623]C.C.Aggarwal.OutlierAnalysis.SpringerScience&BusinessMedia,2013.

[624]C.C.AggarwalandP.S.Yu.OutlierDetectionforHighDimensionalData.InProceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofData,SIGMOD’01,pages37–46,NewYork,NY,USA,2001.ACM.

[625]L.Akoglu,H.Tong,andD.Koutra.Graphbasedanomalydetectionanddescription:asurvey.DataMiningandKnowledgeDiscovery,29(3):626–688,2015.

[626]V.Barnett.TheStudyofOutliers:PurposeandModel.AppliedStatistics,27(3):242–250,1978.

[627]V.BarnettandT.Lewis.OutliersinStatisticalData.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,3rdedition,April1994.

[628]S.D.BayandM.Schwabacher.Miningdistance-basedoutliersinnearlineartimewithrandomizationandasimplepruningrule.InProc.ofthe

9thIntl.Conf.onKnowledgeDiscoveryandDataMining,pages29–38.ACMPress,2003.

[629]R.J.BeckmanandR.D.Cook.‘Outlier……….s’.Technometrics,25(2):119–149,May1983.

[630]R.J.BeckmanandR.D.Cook.[‘Outlier……….s’]:Response.Technometrics,25(2):161–163,May1983.

[631]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.OPTICS-OF:IdentifyingLocalOutliers.InProceedingsoftheThirdEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery,pages262–270.Springer-Verlag,1999.

[632]M.M.Breunig,H.-P.Kriegel,R.T.Ng,andJ.Sander.LOF:Identifyingdensity-basedlocaloutliers.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages93–104.ACMPress,2000.

[633]V.Chandola,A.Banerjee,andV.Kumar.Anomalydetection:Asurvey.ACMcomputingsurveys(CSUR),41(3):15,2009.

[634]A.Chaudhary,A.S.Szalay,andA.W.Moore.Veryfastoutlierdetectioninlargemultidimensionaldatasets.InProc.ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery(DMKD),2002.

[635]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.WorkshoponLearningfromImbalancedDataSetsII,20thIntl.Conf.onMachine

Learning,2000.AAAIPress.

[636]N.V.Chawla,N.Japkowicz,andA.Kolcz,editors.SIGKDDExplorationsNewsletter,Specialissueonlearningfromimbalanceddatasets,volume6(1),June2004.ACMPress.

[637]S.ChawlaandA.Gionis.k-means-:AUnifiedApproachtoClusteringandOutlierDetection.InSDM,pages189–197.SIAM,2013.

[638]C.ChenandL.-M.Liu.JointEstimationofModelParametersandOutlierEffectsinTimeSeries.JournaloftheAmericanStatisticalAssociation,88(421):284–297,March1993.

[639]L.DaviesandU.Gather.TheIdentificationofMultipleOutliers.JournaloftheAmericanStatisticalAssociation,88(423):782–792,September1993.

[640]J.DunaganandS.Vempala.Optimaloutlierremovalinhigh-dimensionalspaces.JournalofComputerandSystemSciences,SpecialIssueonSTOC2001,68(2):335–373,March2004.

[641]E.Eskin.AnomalyDetectionoverNoisyDatausingLearnedProbabilityDistributions.InProc.ofthe17thIntl.Conf.onMachineLearning,pages255–262,2000.

[642]E.Eskin,A.Arnold,M.Prerau,L.Portnoy,andS.J.Stolfo.Ageometricframeworkforunsupervisedanomalydetection.InApplicationsofData

MininginComputerSecurity,pages78–100.KluwerAcademics,2002.

[643]A.J.Fox.OutliersinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),34(3):350–363,1972.

[644]P.Garcia-Teodoro,J.Diaz-Verdejo,G.Maciá-Fernández,andE.Vázquez.Anomaly-basednetworkintrusiondetection:Techniques,systemsandchallenges.computers&security,28(1):18–28,2009.

[645]A.GhoshandA.Schwartzbard.AStudyinUsingNeuralNetworksforAnomalyandMisuseDetection.In8thUSENIXSecuritySymposium,August1999.

[646]R.GnanadesikanandJ.R.Kettenring.RobustEstimates,Residuals,andOutlierDetectionwithMultiresponseData.Biometrics,28(1):81–124,March1972.

[647]J.HardinandD.M.Rocke.OutlierDetectionintheMultipleClusterSettingusingtheMinimumCovarianceDeterminantEstimator.ComputationalStatisticsandDataAnalysis,44:625–638,2004.

[648]D.M.Hawkins.IdentificationofOutliers.MonographsonAppliedProbabilityandStatistics.Chapman&Hall,May1980.

[649]D.M.Hawkins.‘[Outlier……….s]’:Discussion.Technometrics,25(2):155–156,May1983.

[650]S.Hawkins,H.He,G.J.Williams,andR.A.Baxter.OutlierDetectionUsingReplicatorNeuralNetworks.InDaWaK2000:Proc.ofthe4thIntnl.Conf.onDataWarehousingandKnowledgeDiscovery,pages170–180.Springer-Verlag,2002.

[651]V.J.HodgeandJ.Austin.ASurveyofOutlierDetectionMethodologies.ArtificialIntelligenceReview,22:85–126,2004.

[652]H.V.Jagadish,N.Koudas,andS.Muthukrishnan.MiningDeviantsinaTimeSeriesDatabase.InProc.ofthe25thVLDBConf.,pages102–113,1999.

[653]N.Japkowicz,editor.WorkshoponLearningfromImbalancedDataSetsI,SeventeenthNationalConferenceonArtificialIntelligence,PublishedasTechnicalReportWS-00-05,2000.AAAIPress.

[654]T.Johnson,I.Kwok,andR.T.Ng.FastComputationof2-DimensionalDepthContours.InKDD98,pages224–228,1998.

[655]M.V.Joshi.OnEvaluatingPerformanceofClassifiersforRareClasses.InProc.ofthe2002IEEEIntl.Conf.onDataMining,pages641–644,2002.

[656]M.V.Joshi,R.C.Agarwal,andV.Kumar.Miningneedleinahaystack:Classifyingrareclassesviatwo-phaseruleinduction.InProc.of2001ACM-SIGMODIntl.Conf.onManagementofData,pages91–102.ACMPress,2001.

[657]M.V.Joshi,R.C.Agarwal,andV.Kumar.Predictingrareclasses:canboostingmakeanyweaklearnerstrong?InProc.of2002ACM-SIGMODIntl.Conf.onManagementofData,pages297–306.ACMPress,2002.

[658]M.V.Joshi,R.C.Agarwal,andV.Kumar.PredictingRareClasses:ComparingTwo-PhaseRuleInductiontoCost-SensitiveBoosting.InProc.ofthe6thEuropeanConf.ofPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages237–249.Springer-Verlag,2002.

[659]M.V.Joshi,V.Kumar,andR.C.Agarwal.EvaluatingBoostingAlgorithmstoClassifyRareClasses:ComparisonandImprovements.InProc.ofthe2001IEEEIntl.Conf.onDataMining,pages257–264,2001.

[660]J.Kawale,S.Chatterjee,A.Kumar,S.Liess,M.Steinbach,andV.Kumar.Anomalyconstructioninclimatedata:issuesandchallenges.InNASAConferenceonIntelligentDataUnderstandingCIDU,2011.

[661]E.Keogh,S.Lonardi,andB.Chiu.FindingSurprisingPatternsinaTimeSeriesDatabaseinLinearTimeandSpace.InProc.ofthe8thIntl.Conf.onKnowledgeDiscoveryandDataMining,Edmonton,Alberta,Canada,July2002.

[662]S.S.KhanandM.G.Madden.One-classclassification:taxonomyofstudyandreviewoftechniques.TheKnowledgeEngineeringReview,29(03):345–374,2014.

[663]E.M.KnorrandR.T.Ng.AUnifiedNotionofOutliers:PropertiesandComputation.InProc.ofthe3rdIntl.Conf.onKnowledgeDiscoveryand

DataMining,pages219–222,1997.

[664]E.M.KnorrandR.T.Ng.AlgorithmsforMiningDistance-BasedOutliersinLargeDatasets.InProc.ofthe24thVLDBConf.,pages392–403,August1998.

[665]E.M.Knorr,R.T.Ng,andV.Tucakov.Distance-basedoutliers:algorithmsandapplications.TheVLDBJournal,8(3-4):237–253,2000.

[666]T.LaneandC.E.Brodley.AnApplicationofMachineLearningtoAnomalyDetection.InProc.20thNIST-NCSCNationalInformationSystemsSecurityConf.,pages366–380,1997.

[667]A.Lazarevic,L.Ertöz,V.Kumar,A.Ozgur,andJ.Srivastava.AComparativeStudyofAnomalyDetectionSchemesinNetworkIntrusionDetection.InProc.ofthe2003SIAMIntl.Conf.onDataMining,2003.

[668]A.Lazarevic,V.Kumar,andJ.Srivastava.IntrusionDetection:ASurvey.InManagingCyberThreats:Issues,ApproachesandChallenges,pages19–80.KluwerAcademicPublisher,2005.

[669]W.LeeandS.J.Stolfo.DataMiningApproachesforIntrusionDetection.In7thUSENIXSecuritySymposium,pages26–29,January1998.

[670]W.Lee,S.J.Stolfo,andK.W.Mok.ADataMiningFrameworkforBuildingIntrusionDetectionModels.InIEEESymposiumonSecurityandPrivacy,pages120–132,1999.

[671]W.LeeandD.Xiang.Information-theoreticmeasuresforanomalydetection.InProc.ofthe2001IEEESymposiumonSecurityandPrivacy,pages130–143,May2001.

[672]K.-L.Li,H.-K.Huang,S.-F.Tian,andW.Xu.Improvingone-classSVMforanomalydetection.InMachineLearningandCybernetics,2003InternationalConferenceon,volume5,pages3077–3081.IEEE,2003.

[673]R.Y.Liu,J.M.Parelius,andK.Singh.Multivariateanalysisbydatadepth:descriptivestatistics,graphicsandinference.AnnalsofStatistics,27(3):783–858,1999.

[674]M.MarkouandS.Singh.Noveltydetection:Areview–part1:Statisticalapproaches.SignalProcessing,83(12):2481–2497,2003.

[675]M.MarkouandS.Singh.Noveltydetection:Areview–part2:Neuralnetworkbasedapproaches.SignalProcessing,83(12):2499–2521,2003.

[676]C.R.Muirhead.DistinguishingOutlierTypesinTimeSeries.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),48(1):39–47,1986.

[677]S.Papadimitriou,H.Kitagawa,P.B.Gibbons,andC.Faloutsos.Loci:Fastoutlierdetectionusingthelocalcorrelationintegral.InDataEngineering,2003.Proceedings.19thInternationalConferenceon,pages315–326.IEEE,2003.

[678]M.A.Pimentel,D.A.Clifton,L.Clifton,andL.Tarassenko.Areviewofnoveltydetection.SignalProcessing,99:215–249,2014.

[679]L.Portnoy,E.Eskin,andS.J.Stolfo.Intrusiondetectionwithunlabeleddatausingclustering.InInACMWorkshoponDataMiningAppliedtoSecurity,2001.

[680]S.Ramaswamy,R.Rastogi,andK.Shim.Efficientalgorithmsforminingoutliersfromlargedatasets.InProc.of2000ACM-SIGMODIntl.Conf.onManagementofData,pages427–438.ACMPress,2000.

[681]D.M.RockeandD.L.Woodruff.IdentificationofOutliersinMultivariateData.JournaloftheAmericanStatisticalAssociation,91(435):1047–1061,September1996.

[682]B.Rosner.OntheDetectionofManyOutliers.Technometrics,17(3):221–227,1975.

[683]P.J.RousseeuwandA.M.Leroy.RobustRegressionandOutlierDetection.WileySeriesinProbabilityandStatistics.JohnWiley&Sons,September2003.

[684]P.J.Rousseeuw,I.Ruts,andJ.W.Tukey.TheBagplot:ABivariateBoxplot.TheAmericanStatistician,53(4):382–387,November1999.

[685]P.J.RousseeuwandB.C.vanZomeren.UnmaskingMultivariateOutliersandLeveragePoints.JournaloftheAmericanStatistical

Association,85(411):633–639,September1990.

[686]B.Schölkopf,R.C.Williamson,A.J.Smola,J.Shawe-Taylor,J.C.Platt,etal.SupportVectorMethodforNoveltyDetection.InNIPS,volume12,pages582–588,1999.

[687]E.Schubert,R.Wojdanowski,A.Zimek,andH.-P.Kriegel.Onevaluationofoutlierrankingsandoutlierscores.InProceedingsofthe2012SIAMInternationalConferenceonDataMining.SIAM,2012.

[688]E.Schubert,A.Zimek,andH.-P.Kriegel.GeneralizedOutlierDetectionwithFlexibleKernelDensityEstimates.InSDM,volume14,pages542–550.SIAM,2014.

[689]E.Schubert,A.Zimek,andH.-P.Kriegel.Localoutlierdetectionreconsidered:ageneralizedviewonlocalitywithapplicationstospatial,video,andnetworkoutlierdetection.DataMiningandKnowledgeDiscovery,28(1):190–237,2014.

[690]D.W.Scott.PartialMixtureEstimationandOutlierDetectioninDataandRegression.InM.Hubert,G.Pison,A.Struyf,andS.V.Aelst,editors,TheoryandApplicationsofRecentRobustMethods,StatisticsforIndustryandTechnology.Birkhauser,2003.

[691]S.Shekhar,C.-T.Lu,andP.Zhang.AUnifiedApproachtoDetectingSpatialOutliers.GeoInformatica,7(2):139–166,June2003.

[692]M.-L.Shyu,S.-C.Chen,K.Sarinnapakorn,andL.Chang.ANovelAnomalyDetectionSchemeBasedonPrincipalComponentClassifier.InProc.ofthe2003IEEEIntl.Conf.onDataMining,pages353–365,2003.

[693]P.Sykacek.Equivalenterrorbarsforneuralnetworkclassifierstrainedbybayesianinference.InProc.oftheEuropeanSymposiumonArtificialNeuralNetworks,pages121–126,1997.

[694]J.Tang,Z.Chen,A.W.-c.Fu,andD.Cheung.Arobustoutlierdetectionschemeforlargedatasets.InIn6thPacific-AsiaConf.onKnowledgeDiscoveryandDataMining.Citeseer,2001.

[695]N.YeandQ.Chen.Chi-squareStatisticalProfilingforAnomalyDetection.InProc.ofthe2000IEEEWorkshoponInformationAssuranceandSecurity,pages187–193,June2000.

[696]A.Zimek,E.Schubert,andH.-P.Kriegel.Asurveyonunsupervisedoutlierdetectioninhigh-dimensionalnumericaldata.StatisticalAnalysisandDataMining,5(5):363–387,2012.

9.11Exercises1.CompareandcontrastthedifferenttechniquesforanomalydetectionthatwerepresentedinSection9.2 .Inparticular,trytoidentifycircumstancesinwhichthedefinitionsofanomaliesusedinthedifferenttechniquesmightbeequivalentorsituationsinwhichonemightmakesense,butanotherwouldnot.Besuretoconsiderdifferenttypesofdata.

2.Considerthefollowingdefinitionofananomaly:Ananomalyisanobjectthatisunusuallyinfluentialinthecreationofadatamodel.

a. Comparethisdefinitiontothatofthestandardmodel-baseddefinitionofananomaly.

b. Forwhatsizesofdatasets(small,medium,orlarge)isthisdefinitionappropriate?

3.Inoneapproachtoanomalydetection,objectsarerepresentedaspointsinamultidimensionalspace,andthepointsaregroupedintosuccessiveshells,whereeachshellrepresentsalayeraroundagroupingofpoints,suchasaconvexhull.Anobjectisananomalyifitliesinoneoftheoutershells.

a. TowhichofthedefinitionsofananomalyinSection9.2 isthisdefinitionmostcloselyrelated?

b. Nametwoproblemswiththisdefinitionofananomaly.

4.Associationanalysiscanbeusedtofindanomaliesasfollows.Findstrongassociationpatterns,whichinvolvesomeminimumnumberofobjects.Anomaliesarethoseobjectsthatdonotbelongtoanysuchpatterns.Tomakethismoreconcrete,wenotethatthehypercliqueassociationpattern

discussedinSection5.8 isparticularlysuitableforsuchanapproach.Specifically,givenauser-selectedh-confidencelevel,maximalhypercliquepatternsofobjectsarefound.Allobjectsthatdonotappearinamaximalhypercliquepatternofatleastsizethreeareclassifiedasoutliers.

a. Doesthistechniquefallintoanyofthecategoriesdiscussedinthischapter?Ifso,whichone?

b. Nameonepotentialstrengthandonepotentialweaknessofthisapproach.

5.Discusstechniquesforcombiningmultipleanomalydetectiontechniquestoimprovetheidentificationofanomalousobjects.Considerbothsupervisedandunsupervisedcases.

6.Describethepotentialtimecomplexityofanomalydetectionapproachesbasedonthefollowingapproaches:model-basedusingclustering,proximity-based,anddensity.Noknowledgeofspecifictechniquesisrequired.Rather,focusonthebasiccomputationalrequirementsofeachapproach,suchasthetimerequiredtocomputethedensityofeachobject.

7.TheGrubbs’test,whichisdescribedbyAlgorithm9.2 ,isamorestatisticallysophisticatedprocedurefordetectingoutliersthanthatofDefinition9.2 .Itisiterativeandalsotakesintoaccountthefactthatthez-scoredoesnothaveanormaldistribution.Thisalgorithmcomputesthez-scoreofeachvaluebasedonthesamplemeanandstandarddeviationofthecurrentsetofvalues.Thevaluewiththelargestmagnitudez-scoreisdiscardedifitsz-scoreislargerthan thecriticalvalueofthetestforanoutlieratsignificancelevelα.Thisprocessisrepeateduntilnoobjectsareeliminated.Notethatthesamplemean,standarddeviation,and areupdatedateachiteration.

gc,

gc

a. Whatisthelimitofthevalue usedforGrubbs’testasmapproachesinfinity?Useasignificancelevelof0.05.

b. Describe,inwords,themeaningofthepreviousresult.

Algorithm9.2Grubbs’approachforoutlier

elimination.

m−1mtc2m−2+tc2

Inputthevaluesandα

{misthenumberofvalues,αisaparameter,and isavaluechosensothat foratdistributionwith degreesoffreedom.}

repeat

Computethesamplemean andstandarddeviation

Computeavalue sothat

(Intermsof andm, )

Computethez-scoreofeachvalue,i.e.,

Let i.e.,findthez-scoreoflargestmagnitudeandcallitg.

if then

Eliminatethevaluecorrespondingtog.

endif

untilNoobjectsareeliminated.

1:

tcα=P(x≥tc) m−2

2:

3: (x¯) (sx).

4: gc P(|z|≥gc)=α.

tc gc=m−1mtc2m−2+tc2.

5: z=(x─x¯)/sx.

6: g=max|z|

7: g>gc

8:

9: m←m−1

10:

11:

8.Manystatisticaltestsforoutliersweredevelopedinanenvironmentinwhichafewhundredobservationswasalargedataset.Weexplorethelimitationsofsuchapproaches.

a. Forasetof1,000,000values,howlikelyarewetohaveoutliersaccordingtothetestthatsaysavalueisanoutlierifitismorethanthreestandarddeviationsfromtheaverage?(Assumeanormaldistribution.)

b. Doestheapproachthatstatesanoutlierisanobjectofunusuallylowprobabilityneedtobeadjustedwhendealingwithlargedatasets?Ifso,how?

9.TheprobabilitydensityofapointxwithrespecttoamultivariatenormaldistributionhavingameanμandcovariancematrixΣisgivenbytheequation

Usingthesamplemean andcovariancematrixSasestimatesofthemeanμandcovariancematrixΣ,respectively,showthatthe isequaltotheMahalanobisdistancebetweenadatapointxandthesamplemean plusaconstantthatdoesnotdependonx.

10.Comparethefollowingtwomeasuresoftheextenttowhichanobjectbelongstoacluster:(1)distanceofanobjectfromthecentroidofitsclosestclusterand(2)thesilhouettecoefficientdescribedinSection7.5.2 .

11.Considerthe(relativedistance)K-meansschemeforoutlierdetectiondescribedinSection9.5 andtheaccompanyingfigure,Figure9.10 .

a. ThepointsatthebottomofthecompactclustershowninFigure9.10haveasomewhathigheroutlierscorethanthosepointsatthetopofthecompactcluster.Why?

f(x)=1(2π)m|Σ|1/2e−(x−μ)Σ−1(x−μ)2. (9.10)

x─log(f(x))

x─

b. Supposethatwechoosethenumberofclusterstobemuchlarger,e.g.,10.Wouldtheproposedtechniquestillbeeffectiveinfindingthemostextremeoutlieratthetopofthefigure?Whyorwhynot?

c. Theuseofrelativedistanceadjustsfordifferencesindensity.Giveanexampleofwheresuchanapproachmightleadtothewrongconclusion.

12.Iftheprobabilitythatanormalobjectisclassifiedasananomalyis0.01andtheprobabilitythatananomalousobjectisclassifiedasanomalousis0.99,thenwhatisthefalsealarmrateanddetectionrateif99%oftheobjectsarenormal?(Usethedefinitionsgivenbelow.)

13.Whenacomprehensivetrainingsetisavailable,asupervisedanomalydetectiontechniquecantypicallyoutperformanunsupervisedanomalytechniquewhenperformanceisevaluatedusingmeasuressuchasthedetectionandfalsealarmrate.However,insomecases,suchasfrauddetection,newtypesofanomaliesarealwaysdeveloping.Performancecanbeevaluatedaccordingtothedetectionandfalsealarmrates,becauseitisusuallypossibletodetermineuponinvestigationwhetheranobject(transaction)isanomalous.Discusstherelativemeritsofsupervisedandunsupervisedanomalydetectionundersuchconditions.

14.Consideragroupofdocumentsthathasbeenselectedfromamuchlargersetofdiversedocumentssothattheselecteddocumentsareasdissimilarfromoneanotheraspossible.Ifweconsiderdocumentsthatarenothighlyrelated(connected,similar)tooneanotherasbeinganomalous,thenallofthedocumentsthatwehaveselectedmightbeclassifiedasanomalies.Isit

detectionrate=numberofanomaliesdetectedtotalnumberofanomalies (9.11)

falsealarmrate=numberoffalseanomaliesnumberofobjectsclassifiedasanomalies(9.12)

possibleforadatasettoconsistonlyofanomalousobjectsoristhisanabuseoftheterminology?

15.Considerasetofpoints,wheremostpointsareinregionsoflowdensity,butafewpointsareinregionsofhighdensity.Ifwedefineananomalyasapointinaregionoflowdensity,thenmostpointswillbeclassifiedasanomalies.Isthisanappropriateuseofthedensity-baseddefinitionofananomalyorshouldthedefinitionbemodifiedinsomeway?

16.Considerasetofpointsthatareuniformlydistributedontheinterval[0,1].Isthestatisticalnotionofanoutlierasaninfrequentlyobservedvaluemeaningfulforthisdata?

17.Ananalystappliesananomalydetectionalgorithmtoadatasetandfindsasetofanomalies.Beingcurious,theanalystthenappliestheanomalydetectionalgorithmtothesetofanomalies.

a. Discussthebehaviorofeachoftheanomalydetectiontechniquesdescribedinthischapter.(Ifpossible,trythisforrealdatasetsandalgorithms.)

b. Whatdoyouthinkthebehaviorofananomalydetectionalgorithmshouldbewhenappliedtoasetofanomalousobjects?

10AvoidingFalseDiscoveries

Thepreviouschaptershavedescribedthealgorithms,concepts,andmethodologiesoffourkeyareasofdatamining:classification,associationanalysis,clusteranalysis,andanomalydetection.Athoroughunderstandingofthismaterialprovidesthefoundationrequiredtostartanalyzingdatainreal-worldsituations.However,withoutcarefulconsiderationofsomeimportantissuesinevaluatingtheperformanceofadataminingprocedure,theresultsproducedmaynotbemeaningfulorreproducible,i.e.,theresultsmaybefalsediscoveries.Thewidespreadnatureofthisproblemhasbeenreportedbyanumberofhigh-profilepublicationsinscientificfields,andislikewise,commonincommerceandgovernment.Hence,itisimportanttounderstandsomeofthecommonreasonsforunreliabledataminingresultsandhowtoavoidthesefalsediscoveries.

Whenadataminingalgorithmisappliedtoadataset,itwilldutifullyproduceclusters,patterns,predictivemodels,oralistofanomalies.However,anyavailabledatasetisonlyafinitesamplefromtheoverall

population(distribution)ofallinstances,andthereisoftensignificantvariabilityamonginstanceswithinapopulation.Thus,thepatternsandmodelsdiscoveredfromaspecificdatasetmaynotalwayscapturethetruenatureofthepopulation,i.e.,allowaccurateestimationormodelingofthepropertiesofinterest.Sometimes,thesamealgorithmwillproduceentirelydifferentorinconsistentresultswhenappliedtoanothersampleofdata,thusindicatingthatthediscoveredresultsarespurious,e.g.,notreproducible.

Toproducevalid(reliableandreproducible)results,itisimportanttoensurethatadiscoveredpatternorrelationshipinthedataisnotanoutcomeofrandomchance(arisingduetonaturalvariabilityinthedatasamples),butrather,representsasignificanteffect.Thisofteninvolvesusingstatisticalprocedures,aswillbedescribedlater.Whileensuringthesignificanceofasingleresultisdemanding,theproblembecomesmorecomplexwhenwehavemultipleresultsthatneedtobeevaluatedsimultaneously,suchasthelargenumbersofitemsetstypicallydiscoveredbyafrequentpatternminingalgorithm.Inthiscase,manyorevenmostoftheresultswillrepresentfalsediscoveries.Thisisalsodiscussedindetailinthischapter.

Thepurposeofthischapteristocoverafewselectedtopics,knowledgeofwhichisimportantforavoidingcommondataanalysisproblemsandproducingvaliddataminingresults.Someofthesetopicshavebeen

discussedinspecificcontextsearlierinthebook,particularlyintheevaluationsectionsoftheprecedingchapters.Wewillbuilduponthesediscussionstoprovideanindepthviewofsomestandardproceduresforavoidingfalsediscoveriesthatareapplicableacrossmostareasofdatamining.Manyoftheseapproachesweredevelopedbystatisticiansfordesignedexperiments,wherethegoalwastocontrolexternalfactorsasmuchaspossible.Currently,however,theseapproachesareoften(perhapsevenmostly)appliedtoobservationaldata.Akeygoalofthischapteristoshowhowthesetechniquescanbeappliedtotypicaldataminingtaskstohelpensurethattheresultingmodelsandpatternsarevalid.

10.1Preliminaries:StatisticalTestingBeforewediscussapproachesforproducingvalidresultsindataminingproblems,wefirstintroducethebasicparadigmofstatisticaltestingthatiswidelyusedformakinginferencesaboutthevalidityofresults.Astatisticaltestisagenericprocedureformeasuringtheevidenceforacceptingorrejectingahypothesisthattheoutcome(result)ofanexperimentoradataanalysisprocedureprovides.Forexample,giventheoutcomeofanexperimenttostudyanewdrugforadisease,wecantesttheevidenceforthehypothesisthatthereisameasurableeffectofthedrugintreatingthedisease.Asanotherexample,giventheoutcomeofaclassifieronatestdataset,wecantesttheevidenceforthehypothesisthattheclassifierperformsbetterthanrandomguessing.Inthefollowing,wedescribedifferentframeworksforstatisticaltesting.

10.1.1SignificanceTestingSupposeyouwanttohireastockbrokerwhocanmakeprofitabledecisionsonyourinvestmentswith

ahighsuccessrate.Youknowofastockbroker,Alice,whomadeprofitabledecisionsfor7ofherlast

10stockpicks.Howconfidentwouldyoubeinofferingherthejobbecauseofyourassumptionthata

performanceasgoodasAlice’sisnotlikelyduetorandomchance?

Questionsoftheabovetypecanbeansweredusingthebasictoolsofsignificancetesting.Notethatinanygeneralproblemofstatisticaltesting,wearelookingforsomeevidenceintheoutcometovalidateadesiredphenomenon,pattern,orrelationship.Fortheproblemofhiringasuccessfulstockbroker,thedesiredphenomenonisthatAliceindeedhasknowledgeof

howthestockpricesvaryandusesthisknowledgetomake7correctdecisionsoutof10.However,thereisalsoapossibilitythatAlice’sperformanceisnobetterthanwhatmightbeobtainedbyrandomlyguessingonall10decisions.Theprimarygoalofsignificancetestingistocheckwhethertherearesufficientevidenceintheoutcometorejectthedefaulthypothesis(alsocallednullhypothesis)thatAlice’sperformanceformakingprofitablestockdecisionsisnobetterthanrandom.

NullHypothesisThenullhypothesisisageneralstatementthatthedesiredpatternorphenomenonofinterestisnottrueandtheobservedoutcomecanbeexplainedbythenaturalvariability,e.g.,byrandomchance.Thenullhypothesisisassumedtobetrueuntilthereissufficientevidencetoindicateotherwise.Itiscommonlydenotedas .Informally,iftheresultobtainedfromthedataisunlikelyunderthenullhypothesis,thisprovidesevidencethatourresultisnotjustaresultofnaturalvariabilityinthedata.

Forexample,thenullhypothesisinthestockbrokerproblemcouldbethatAliceisnobetteratmakingdecisionsthanapersonwhoperformsrandomguessing.RejectingthisnullhypothesiswouldimplythatthereissufficientgroundstobelieveAlice’sperformanceisbetterthanrandomguessing.Moregenerally,weareinterestedintherejectionofthenullhypothesissincethattypicallyimpliesanoutcomethatisnotduetonaturalvariability.

Sincedeclaringthenullhypothesisisthefirststepintheframeworkofsignificancetesting,caremustbetakentostateitinapreciseandcompletemannersothatthesubsequentstepsproducemeaningfulresults.Thisisimportantbecausemisstatingorlooselystatingthenullhypothesiscanyieldmisleadingconclusions.Ageneralapproachistobeginwithastatementofthedesiredresult,e.g.,apatterncapturesanactualrelationshipbetween

H0

variables,andtakethenullhypothesistobethenegation(opposite)ofthatstatement,e.g.,thepatternisduetonaturalvariabilityinthedata.

TestStatisticToperformsignificancetesting,wefirstneedawaytoquantifytheevidenceintheobservedoutcomewithrespecttothenullhypothesis.Thisisachievedbyusingateststatistic,R,whichtypicallysummarizeseverypossibleoutcomeasanumericalvalue.Morespecifically,theteststatisticenablesthecomputationoftheprobabilityofanoutcomeunderthenullhypothesis.Forexample,inthestockbrokerproblem,Rcouldbethenumberofsuccessful(profitable)decisionsmadeinthelast10decisions.Inthisway,theteststatisticreducesanoutcomeconsistingof10differentdecisionsintoasinglenumericalvalue,i.e.,thecountofsuccessfuldecisions.

Theteststatisticistypicallyacountorreal-valuedquantityandmeasureshow“extreme”anobservedresultisunderthenullhypothesis.Dependingonthechoiceofthenullhypothesisandthewaytheteststatisticisdesigned,therecanbedifferentwaysofdefiningwhatis“extreme”relativetothenullhypothesis.Forexample,anobservedteststatistic, ,canbeconsideredextremeifitisgreaterthanorequaltoacertainvalue, ,smallerthanorequaltoacertainvalue, ,oroutsideaspecifiedinterval, .Thefirsttwocasesresultin“one-sidedtests”(right-tailedandleft-tailed,respectively),whilethelastcaseresultsina“two-sidedtest.”

NullDistributionHavingdecidedanappropriateteststatisticforaproblem,thenextstepinsignificancetestingistodeterminethedistributionoftheteststatisticunder

RobsRH

RL [RL,RH]

thenullhypothesis.Thisisknownasthenulldistribution,whichcanbeformallydescribedasfollows.

Definition10.1(nulldistribution).Givenateststatistic,R,thedistributionofRunderthenullhypothesis, ,iscalledthenulldistribution, .

Thenulldistributioncanbedeterminedinanumberofways.Forexample,wecanusestatisticalassumptionsaboutthebehaviorofRunder togenerateexactstatisticalmodelsofthenulldistribution.Wecanalsoconductexperimentstoproducesamplesfrom andthenanalyzethesesamplestoapproximatethenulldistribution.Ingeneral,theapproachfordeterminingthenulldistributiondependsonthespecificcharacteristicsoftheproblem.WewilldiscussapproachesfordeterminingthenulldistributioninthecontextofdataminingproblemsinSection10.2 .Weillustratewithanexampleofthenulldistributionforthestockbrokerproblem.

Example10.1(NullDistributionforStockbrokerProblem).Considerthestockbrokerproblemwheretheteststatistic,R,isthenumberofsuccessesofastockbrokerinthelast decisions.Underthenullhypothesisthatthestockbrokerperformsnobetterthanrandomguessing,theprobabilityofmakingasuccessfuldecisionwouldbe .Assumingthatthedecisionsondifferentdaysareindependentofeachother,theprobabilityofobtaininganobservedvalueofR,thetotalnumberof

H0 P(R|H0)

H0

H0

N=100

p=0.5

successesinNdecisions,underthenullhypothesiscanbemodeledusingthebinomialdistribution,whichisdescribedbythefollowingequation:

Figure10.1 showstheplotofthisnulldistributionasafunctionofRfor.

Figure10.1.Nulldistributionforthestockbrokerproblemwith .

Thenulldistributioncanbeusedtodeterminehowunlikelyisittoobtaintheobservedvalueofteststatistic, ,underthenullhypothesis.Inparticular,thenullhypothesiscanbeusedtocomputetheprobabilityofobtainingor“somethingmoreextreme”underthenullhypothesis.Thisprobabilityiscalledthep-value,whichcanbeformallydefinedasfollows:

P(R|H0)=(NR)×pR×(1−p)N−R.

N=100

N=100

RobsRobs

Definition10.2(p-value).Thep-valueofanobservedteststatistic, ,istheprobabilityofobtaining orsomethingmoreextremefromthenulldistribution.Dependingonhow“moreextreme”isdefinedfortheteststatistic,R,underthenullhypothesis, ,thep-valueof

canbewrittenasfollows:

Thereasonthatweaccountfor“somethingmoreextreme”inthecalculationofp-valuesisthattheprobabilityofanyparticularresultisoften0orcloseto0.P-valuesthuscapturetheaggregatetailprobabilitiesofthenulldistributionforteststatisticvaluesthatareatleastasextremeas .Forthestockbrokerproblem,sincelargervaluesoftheteststatistic(countofsuccessfuldecisions)wouldbeconsideredmoreextremeunderthenullhypothesis,wewouldcomputep-valuesusingtherighttailofthenulldistribution.

Example10.2(P-valuesareTailProbabilities).Toillustratethefactthatp-valuescanbecomputedusingthelefttail,righttail,orboth,consideranexamplewherethenulldistributionhasaGaussiandistributionwithmean0andstandarddeviation1,i.e., .Figure10.2 showstheteststatisticvaluescorrespondingtoap-valueof0.05forleft-tailed,righttailed,ortwo-sidedtests.(Seeshadedregions.)

RobsRobs

H0Robs

p-value(Robs)={P(R≥Robs|H0),forright-tailedtests.P(R≤Robs|H0),forleft-tailedtests.P(R≥Robs| or R≤−|Robs| |H0),fortwo-sidedtests.

Robs

N(0,1)

Wecanseethatthep-valuescorrespondtotheareainthetailsofthenulldistribution.Whileatwo-sidedtesthas0.025probabilityineachofthetails,aone-sidedtesthasallofits0.05probabilityinasingletail.

Figure10.2.Illustrationofp-valuesasshadedregionsforleft-tailed,right-tailed,andtwo-sidedtests.

AssessingStatisticalSignificanceP-valuesprovidethenecessarytooltoassessthestrengthoftheevidenceinaresultagainstthenullhypothesis.Thekeyideaisthatifthep-valueislow,thenaresultatleastasextremeastheobservedresultislesslikelytobeobtainedfrom .Forexample,ifthep-valueofaresultis0.01,thenthereisonlya1%chanceofobservingaresultfromthenullhypothesisthatisatleastasextremeastheobservedresult.

Alowp-valueindicatessmallerprobabilitiesinthetailsofthenulldistribution(forbothone-sidedandtwo-sidedtests).Thiscanprovidesufficientevidencetobelievethattheobservedresultisasignificantdeparturefromthenullhypothesis,thusconvincingustoreject .Formally,weoftenauseathresholdonp-values(calledthelevelofsignificance)anddescribeanobservedp-valuethatislowerthanthisthresholdasstatisticallysignificant.

H0

H0

Definition10.3(StatisticallySignificantResult).Givenauser-definedlevelofsignificance,α,aresultiscalledstatisticallysignificantifithasap-valuelowerthanα.

Somecommonchoicesforthelevelofsignificanceare0.05(5%)and0.01(1%).Thep-valueofastatisticallysignificantresultdenotestheprobabilityoffalselyrejecting when istrue.Hence,alowp-valueprovideshigherconfidencethattheobservedresultisnotlikelytobeconsistentwith ,thusmakingitworthyoffurtherinvestigation.Thisoftenmeansgatheringadditionaldataorconductingnon-statisticalverification,e.g.,byperformingexperimentalvalidation.(SeeBibliographicNotes.)However,evenwhenthep-valueislow,thereisalwayssomechance(unlessthep-valueis0)that istrueandwehavemerelyencounteredarareevent.

Itisimportanttokeepinmindthatap-valueisconditionalprobability,i.e.,iscomputedundertheassumptionthat istrue.Consequently,ap-valueisnottheprobabilityof ,whichmaybelikelyorunlikelyevenifthetestresultisnotsignificant.Thus,ifaresultisnotsignificant,thenitisnotappropriatetosaythatweacceptthenullhypothesis.Instead,itisbettertosaythatwefailtorejectthenullhypothesis.However,whenthenullhypothesisisknowntobetruemostofthetime,e.g.,whentestingforaneffectorresultthatisrarelyseen,itiscommontosaythatweacceptthenullhypothesis.(SeeExercise6 .)

H0 H0H0

H0

H0H0

10.1.2HypothesisTesting

WhilesignificancetestingwasdevelopedbythefamousstatisticianFisherasanactionableframeworkforstatisticalinference,itsintendeduseisonlylimitedtoexploratoryanalysesofthenullhypothesisinthepreliminarystagesofastudy,e.g.,torefinethenullhypothesisormodifyfutureexperiments.Oneofthemajorlimitationsofsignificancetestingisthatitdoesnotexplicitlyspecifyanalternativehypothesis, ,whichistypicallythatthestatementwewouldliketoestablishastrue,i.e.,thataresultisnotspurious.Hence,significancetestingcanbeusedtorejectthenullhypothesisbutisunsuitablefordeterminingwhetheranobservedresultactuallysupports .

Theframeworkofhypothesistesting,developedbythestatisticiansNeymanandPearson,providesamoreobjectiveandrigorousapproachforstatisticaltesting,byexplicitlydefiningbothnullandalternativehypotheses.Hence,apartfromcomputingap-value,i.e.,theprobabilityoffalselyrejectingthenullhypothesiswhen istrue,wecanalsocomputetheprobabilityoffalselysayingaresultisnotsignificantwhenthealternativehypothesisistrue.Thisallowshypothesistestingtoprovideamoredetailedassessmentoftheevidenceprovidedbyanobservedresult.

Inhypothesistesting,wefirstdefineboththenullandalternativehypotheses( and ,respectively)andchooseateststatistic,R,thathelpstodifferentiatethebehaviorofresultsunderthenullhypothesisfromthebehaviorofresultsunderthealternativehypothesis.Aswithsignificancetesting,caremustbetakentoensurethatthenullandalternativehypothesesarepreciselyandcomprehensivelydefined.Wethenmodelthedistributionoftheteststatisticunderthenullhypothesis, ,aswellasunderthealternativehypothesis, .Similartothenulldistribution,therecanbemanywaysofgeneratingthedistributionofRunderthealternative

H1

H1

H0

H0 H1

P(R|H0)P(R|H1)

hypothesis,e.g.,bymakingstatisticalassumptionsaboutthenatureof orbyconductingexperimentsandanalyzingsamplesfrom .Inthefollowingexample,weconcretelyillustrateasimpleapproachformodeling inthestockbrokerproblem.

Example10.3(AlternativeHypothesisforStockbrokerProblem).InExample10.1 ,wesawthatunderthenullhypothesisofrandomguessing,theprobabilityofobtainingasuccessonanygivendaycanbeassumedtobe .Therecouldbemanyalternativehypothesesforthisproblem,allofwhichwouldassumethattheprobabilityofsuccessisgreaterthan0.5,i.e., ,thusrepresentingasituationwherethestockbrokerperformsbetterthanrandomguessing.Tobespecific,assume

.Thedistributionoftheteststatistic(numberofsuccessesindecisions)underthealternativehypothesiswouldthenbegivenbythefollowingbinomialdistribution.

Figure10.3 showstheplotofthisdistribution(dottedline)withrespecttothenulldistribution(solidline).Wecanseethatthealternativedistributionisshiftedtowardtheright.Noticethatifastockbrokerhasmorethan60successes,thenthisoutcomewillbemorelikelyunder than

.

H1H1

P(R|H1)

p=0.5

p>0.5

p=0.7 N=100

P(R|H1)=(NR)×pR×(1−p)N−R.

H1H0

Figure10.3.Nullandalternativedistributionsforthestockbrokerproblemwith .

CriticalRegionGiventhedistributionsoftheteststatisticunderthenullandalternativehypotheses,theframeworkofhypothesistestingdecidesifweshould“reject”thenullhypothesisor“notreject”thenullhypothesisgiventheevidenceprovidedbytheteststatisticcomputedfromanobservedresult.Thisbinarydecisionistypicallymadebyspecifyingarangeofpossiblevaluesoftheteststatisticthataretooextremeunder .Thissetofvaluesiscalledthecriticalregion.Iftheobservedteststatistic, ,fallsinthisregion,thenthenullhypothesisisrejected.Otherwise,thenullhypothesisisnotrejected.

Thecriticalregioncorrespondstothecollectionofextremeresultswhoseprobabilityofoccurrenceunderthenullhypothesisislessthanathreshold.Thecriticalregioncaneitherbeinthelefttail,righttail,orbothleftandrighttailsofthenulldistribution,dependingonthetypeofstatisticaltestingbeingused.Theprobabilityofthecriticalregionunder iscalledthesignificance

N=100

H0Robs

H0

level,α.Inotherwords,itistheprobabilityoffalselyrejectingthenullhypothesisforresultsbelongingtothecriticalregionwhen istrue.Inmostapplications,alowvalueofα(e.g.,0.05or0.01)isspecifiedbytheusertodefinethecriticalregion.

Rejectingthenullhypothesisiftheteststatisticfallsinthecriticalregionisequivalenttoevaluatingthep-valueoftheteststatisticandrejectingthenullhypothesisifthep-valuefallsbelowapre-specifiedthreshold,α.Notethatwhileeveryresulthasadifferentp-value,thesignificancelevel,α,inhypothesistestingisafixedconstantwhosevalueisdecidedbeforeanytestsareperformed.

TypeIandTypeIIErrorsUptothispoint,hypothesistestingmayseemsimilartosignificancetesting,atleastsuperficially.However,byconsideringboththenullandalternativehypotheses,hypothesistestingallowsustolookattwodifferenttypesoferrors,typeIerrorandtypeIIerrors,asdefinedbelow.

Definition10.4(TypeIError).AtypeIerroristheerrorofincorrectlyrejectingthenullhypothesisforaresult.TheprobabilityofincurringatypeIerroriscalledthetypeIerrorrate,α.Itisequaltotheprobabilityofthecriticalregionunder ,i.e.,αisthesameasthesignificancelevel.Formally,

H0

H0

α=P(R∈CriticalRegion|H0).

Definition10.5(TypeIIError).AtypeIIErroristheerroroffalselycallingaresulttobenotsignificantwhenthealternativehypothesisistrue.TheprobabilityofincurringatypeIIerroriscalledthetypeIIerrorrate,β,whichisequaltotheprobabilityofobservingteststatisticvaluesoutsidethecriticalregionunder ,i.e.,

Notethatdecidingthecriticalregion(specifyingα)automaticallydeterminesthevalueofβforaparticulartest,giventhedistributionoftheteststatisticunderthealternativehypothesis.

AcloselyrelatedconcepttothetypeIIerrorrateisthepowerofthetest,whichistheprobabilityofthecriticalregionunder ,i.e., .Powerisanimportantcharacteristicofatestbecauseitindicateshoweffectiveatestwillbeatcorrectlyrejectingthenullhypothesis.Lowpowermeansthatmanyresultsthatactuallyshowthedesiredpatternorphenomenonwillnotbeconsideredsignificantandthuswillbemissed.Asaconsequence,ifthepowerofatestislow,thenitmaynotbeappropriatetoignoreresultsthatfalloutsidethecriticalregion.IncreasingthesizeofthecriticalregionstoincreasepoweranddecreasetypeIIerrorswillincreasetypeIerrors,andvice-versa.Hence,itisthebalancebetweenensuringalowvalueofαandahighvalueofpowerthatisatthecoreofhypothesistesting.

H1

β=P(R∈CriticalRegion|H1).

H1 1−β

Whenthedistributionoftheteststatisticunderthenullandalternativehypothesesdependsonthenumberofsamplesusedtoestimatetheteststatistic,thenincreasingthenumberofsampleshelpsobtainlessvariableestimatesofthetruenullandalternativedistributions.ThisreducesthechancesoftypeIandtypeIIerrors.Forexample,evaluatingastockbrokeron100decisionsismorelikelytogiveusanaccurateestimateoftheirtruesuccessratethanevaluatingthestockbrokeron10decisions.Theminimumnumberofsamplesrequiredforensuringalowvalueofαwhilehavingahighvalueofpowerisoftendeterminedbyastatisticalprocedurecalledpoweranalysis.(SeeBibliographicNotesformoredetails.)

Example10.4(ClassifyingMedicalResults).Supposethevalueofabloodtestisusedastheteststatistic,R,toidentifywhetherapatienthasacertaindiseaseornot.ItisknownthatthevalueofthisteststatistichasaGaussiandistributionwithmean40andstandarddeviation5forpatientsthatdonothavethedisease.Forpatientshavingthedisease,theteststatistichasameanof60andastandarddeviationof5.ThesedistributionsareshowninFigure10.4 . isthenullhypothesisthatthepatientdoesn’thavethedisease,i.e.,comesfromtheleftmostdistributionshowninsubfiguresofFigure10.4 . isthealternativehypothesisthatthepatienthasthedisease,i.e.,comesfromtherightmostdistributioninthesubfiguresofFigure10.4 .

H0

H1

Figure10.4.Distributionofteststatisticforthealternativehypothesis(rightmostdensitycurve)andnullhypothesis(leftmostdensitycurve).Shadedregioninrightsubfigureisα.

Supposethecriticalregionischosentobe50andabove,sincealevelof50isexactlyhalfwaybetweenmeansofthetwodistributions.Thesignificancelevel,α,correspondingtothischoiceofcriticalregion,shownastheshadedregionintherightsubfigureofFigure10.4 ,canthenbecalculatedasfollows:

ThetypeIIerrorrate,β,forthischoiceofcriticalregioncanalsobefoundtobeequalto0.023.(Thisisonlybecausenullandalternativehypotheseshavethesamedistribution,exceptfortheirmeans,andtheobservedvalueishalfwaybetweentheirmeans.)ThisisshownastheshadedregionintheleftsubfigureofFigure10.5 .Thepoweristhenequalto

,whichisshownintherightsubfigureofFigure10.5 .

α=P(R≥50|H0)=P(R≥50|R),R~N(μ=40,σ=5))=∫50∞12πσ2e−(R−μ)22σ2dR=∫50∞150πe−(R−40)250dR=0.023

1−0.023=0.977

Figure10.5.Shadedregioninleftsubfigureisβandshadedregioninrightsubfigureispower.

Ifweuseαof0.05insteadof0.023,thecriticalregionwouldbeslightlyexpandedto48.22andabove.Thiswouldincreasethepowerfrom0.977to0.991,thoughatthecostofahighervalueofα.Ontheotherhand,decreasingαto0.01woulddecreasethepowerto0.952.

EffectSizeEffectsizebringsindomainconsiderationsbyconsideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencewillbestatisticallysignificantwithalargeenoughtestgroup,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Thus,aconsiderationofeffectsizeiscriticalsinceitcanoftenhappenthataresultisstatisticallysignificant,butofnopracticalimportanceinthedomain.Thisisparticularlytrueforlargedatasets.

Definition10.6(effectsize).Theeffectsizemeasuresthemagnitudeoftheeffectorcharacteristicbeingevaluated,andistypicallythemagnitudeoftheteststatistic.

Inmostproblemsthereisadesiredeffectsizethathelpsdeterminesthenullandalternativehypotheses.Forthestockbrokerproblem,asillustratedinExample10.3 ,thedesiredeffectsizeisthedesiredprobabilityofsuccess,0.7.Forthemedicaltestingproblem,whichwasjustdiscussedinExample10.4 ,theeffectsizeisthevalueofthethresholdusedtodefinethecutoffbetweennormalpatientsandthosewiththedisease.Whencomparingthemeansoftwosetsofobservations(AandB),theeffectsizeisthedifferenceinthemeans,i.e., ortheabsolutedifference, .

Thedesiredeffectsizeimpactsthechoiceofthecriticalregion,andthusthesignificancelevelandpowerofthetest.Exercises4 and5 furtherexploresomeoftheseconcepts.

10.1.3MultipleHypothesisTesting

Thestatisticaltestingframeworksdiscussedsofararedesignedtomeasuretheevidenceinasingleresult,i.e.,whethertheresultbelongstothenullhypothesisorthealternativehypothesis.However,manysituationsproducemultipleresultsthatneedtobeevaluated.Forexample,frequentpattern

μA−uB |μA−uB|

miningtypicallyproducesmanyfrequentitemsetsfromagiventransactiondataset,andweneedtotesteveryfrequentitemsettodeterminewhetherthereisastatisticallysignificantassociationamongitsconstituentitems.Themultiplehypothesistestingproblem(alsocalledthemultipletestingproblemormultiplecomparisonproblem)addressesthestatisticaltestingproblemwheremultipleresultsareinvolvedandastatisticaltestisperformedoneveryresult.

Thesimplestapproachistocomputethep-valueunderthenullhypothesisforeachresultindependentlyofotherresults.Ifthep-valueissignificantforanyresult,thenthenullhypothesisisrejectedforthatresult.However,thisstrategywilltypicallyproducemanyerroneousresultswhenthenumberofresultstobetestedislarge.Forexample,evenifsomethingonlyhasa5%chanceofhappeningforasingleresult,thenitwillhappen,onaverage,5timesoutofa100.Thus,ourapproachforhypothesistestingneedstobemodified.

Whenworkingwithmultipletests,weareinterestedinreportingthetotalnumberoferrorscommittedonacollectionofresults(alsoreferredtoasafamilyofresults).Forexample,ifwehaveacollectionofmresults,wecancountthetotalnumberoftimesatypeIerroriscommittedoratypeIIerroriscommittedoutofthemtests.TheaggregateinformationoftheperformanceacrossalltestscanbesummarizedbytheconfusionmatrixshowninTable10.1 .Inthistable,aresultthatactuallybelongstothenullhypothesisiscalleda‘negative’whilearesultthatactuallybelongstothealternativehypothesisiscalleda‘positive.’ThistableisessentiallythesameasTable4.6 ,whichwasintroducedinthecontextofevaluatingclassificationperformanceinSection4.11.2 .

Table10.1.Confusiontableinthecontextofmultiplehypothesistesting.

Declaredsignificant(+ Declarednotsignificant(− Total

prediction) prediction)

True(actual+)

TruePositive(TP) FalseNegative(FN)typeIIerror Positives

True(actual−)

FalsePositive(FP)typeIerror

TrueNegative(TN) Negatives

PositivePredictions(Ppred) NegativePredictions(Npred) m

Inmostpracticalsituationswhereweareperformingmultiplehypothesistesting,e.g.,whileusingstatisticalteststoevaluatewhetheracollectionofpatterns,clusters,etc.arespurious,therequiredentriesinTable10.1 areseldomavailable.(Forclassification,thetableisavailablewhenreliablelabelsareavailable,inwhichcase,manyofthequantitiesofinterestcanbedirectlyestimated.SeeSection10.3.2 .)Whenentriesarenotavailable,weneedtoestimatethem,ormoretypically,quantitiesderivedfromtheseentries.Thefollowingparagraphsofthissectiondescribevariousapproachesfordoingthis.

Family-wiseErrorRate(FWER)Ausefulerrormetricwhendealingwithafamilyofresultsisthefamily-wiseerrorrate(FWER),whichistheprobabilityofobservingevenasinglefalsepositive(typeIerror)intheentiresetofmresults.Inparticular,

IftheFWERislowerthanacertainthreshold,sayα,thentheprobabilityofobservinganytypeIerroramongalltheresultsislessthanα.

H1(m1)

H0(m0)

FEWR=P(FP>0).

TheFWERthusmeasurestheprobabilityofobservingatypeIerrorinanyorallofthemtests.ControllingtheFWER,i.e.,ensuringtheFWERtobelow,isusefulinapplicationswhereasetofresultsisdiscardedifevenasingletestiserroneous(producesatypeIerror).Forexample,considertheproblemofselectingastockbrokerdescribedinExample10.3 .Inthiscase,thegoalistofind,fromapoolofapplicants,astockbrokerthatmakescorrectdecisionsatleast70%ofthetime.EvenasingletypeIerrorcanleadtoanerroneoushiringdecision.Insuchcases,estimatingtheFWERgivesusabetterpictureoftheperformanceoftheentiresetofresultsthanthenaїveapproachofcomputingp-valuesseparatelyforeachresult.Thefollowingexampleillustratesthisconceptinthecontextofthestockbrokerproblem.

Example10.5(TestingMultipleStockbrokers).Considertheproblemofselectingsuccessfulstockbrokersfromapoolof

candidates.Foreverystockbroker,weperformastatisticaltesttocheckwhethertheirperformance(numberofsuccessfuldecisionsinthelastNdecisions)isbetterthanrandomguessing.Ifweuseasignificancelevelof foreverysuchtest,theprobabilityofmakingatypeIerroronanyindividualcandidateisequalto0.05.However,ifweassumethattheresultsareindependent,theprobabilityofobservingevenasingletypeIerrorinanyofthe50tests,i.e.,theFWER,isgivenby

whichisextremelyhigh.Eventhoughtheprobabilityofobservingnofalsepositivesonasingletestisquitehigh ,theprobabilityofseeingnofalsepositivesacrossalltests diminishesbyrepeatedmultiplication.Hence,theFWERcanbequitehighwhenmislarge,evenifthetypeIerrorrate,α,islow.

m=50

α=0.05

FEWR=1−(1−α)m=1−(1−0.05)50=0.923, (10.1)

(1−α=0.95)(0.9550=0.077)

BonferroniProcedureAnumberofprocedureshavebeendevelopedtoensurethattheFWERofasetofresultsislowerthananacceptablethreshold,α,whichisoften0.05.Theseapproaches,calledFWERcontrollingprocedures,basicallytrytoadjustthep-valuethresholdwhichisusedforeverytest,sothatthereisonlyasmallchanceoferroneouslyrejectingthenullhypothesisinthepresenceofmultipletests.Toillustratethiscategoryofprocedures,wedescribethemostconservativeapproach,whichistheBonferroniprocedure.

Definition10.7(Bonferroniprocedure).IfmresultsaretobetestedsothattheFWERislessthanα,theBonferroniproceduresetsthesignificancelevelforeverytesttobe .

TheintuitionbehindtheBonferroniprocedurecanbeunderstoodbyobservingtheformulaforFWERinEquation10.1 ,wherethemtestsareassumedtobeindependentofeachother.Usingareducedsignificancelevelof inEquation10.1 andapplyingthebinomialtheorem,wecanseethattheFWERiscontrolledbelowαasfollows:

α*=α/m

α/m

FWER=1−(1−αm)m=1−(1+m(−αm)+(m2)(−αm)2+…+(−αm)m)=α−(m2)(−αm)2−(m3)(−αm)3−…−(−αm)m≤α.

Whiletheabovediscussionwasforthecasewherethetestsareassumedtobeindependent,theBonferroniapproachguaranteesnotypeIerrorinthemtestswithaprobabilityof ,irrespectiveofwhetherthetests(results)arecorrelatedorindependent.WeillustratetheimportanceoftheBonferroniprocedureforcontrollingFWERusingthefollowingexample.

Example10.6(BonferroniProcedure).InthemultiplestockbrokerproblemdescribedinExample10.5 ,weanalyzetheeffectoftheBonferroniprocedureincontrollingtheFWER.Thenulldistributionforanindividualstockbrokercanbemodeledusingthebinomialdistributionwhere and .Givenasetofmresultssimulatedfromthenulldistribution(assumingtheresultsareindependent),wecomparetheperformanceoftwocompetingapproaches:thenaїveapproach,whichusesasignificancelevelof ,andtheBonferroniprocedure,whichusesasignificancelevelof .

Figure10.6 showstheFWERofthesetwoapproachesaswevarythenumberofresults,m.(Weused simulations.)WecanseethattheFWERoftheBonferroniprocedureisalwayscontrolledtobeα,whiletheFWERofthenaїveapproachshootsuprapidlyandreachesavaluecloseto1formgreaterthan70.Hence,theBonferroniprocedureispreferredoverthenaїveapproachwhenmislargeandtheFWERistheerrormetricwewishtocontrol.

1−α

p=0.5 N=100

α=0.05α*=0.05/m

106

Figure10.6.Thefamilywiseerrorrate(FWER)curvesforthenaїveapproachandtheBonferroniprocedureasafunctionofthenumberofresults,m.

TheBonferroniprocedureisalmostalwaysoverlyconservative,i.e.,itwilleliminatesomenon-spuriousresults,especiallywhenthenumberofresultsislargeandtheresultsmaybecorrelatedtoeachother,e.g.,infrequentpatternmining.Intheextremecasewhereallmresultsareperfectlycorrelatedtoeachother(andhenceidentical),theBonferroniprocedurewouldstilluseasignificancelevelof eventhoughasignificancelevelofαwouldhavesufficed.Toaddressthislimitation,anumberofalternativeFWERcontrollingprocedureshavebeendevelopedthatarelessconservativethanBonferroniwhendealingwithcorrelatedresults.(SeeBibliographicNotesformoredetails.)

Falsediscoveryrate(FDR)

α/m

Bydefinition,allFWERcontrollingproceduresseekalowprobabilityforobtaininganyfalsepositives,andthus,arenottheappropriatetoolwhenthegoalistoallowsomefalsepositivesinordertogetmoretruepositives.Forexample,infrequentpatternmining,weareinterestedinselectingfrequentitemsetsthatshowstatisticallysignificantassociations(actualpositives),whilediscardingtheremainingones.Asanotherexample,whentestingforaseriousdisease,itisbettertogetmoretruepositives(detectmoreactualcasesofthedisease)evenifthatmeansgeneratingsomefalsepositives.Inbothcases,wearereadytotolerateafewfalsepositivesaslongasweareabletoachievereasonablepowerforthedetectionoftruepositives.

Thefalsediscoveryrate(FDR)providesanerrormetrictomeasuretherateoffalsepositives,whicharealsocalledfalsediscoveries.TocomputeFDR,wefirstdefineavariable,Q,thatisequaltothenumberoffalsepositives,FP,dividedbythetotalnumberofresultspredictedaspositive,Ppred.(SeeTable10.1 .)

WhenweknowFP,thenumberoffalsepositives,asinclassification,Qisessentiallythefalsediscoveryrate,asdefinedinSection4.11.2 ,whichintroducedmeasuresforevaluatingclassificationperformanceunderclassimbalance.Assuch,Qiscloselyrelatedtotheprecision.Specifically,

.However,inthecontextofstatisticaltesting,when,i.e.,whennoresultsarepredictedaspositive, bydefinition.

However,indatamining,precisionandthusFDR,asdefinedinSection4.11.2 ,aretypicallyconsideredtobeundefinedinthatsituation.

InthecaseswherewedonotknowFP,itisnotpossibletouseQasthefalsediscoveryrate.Nonetheless,itisstillpossibletoestimatethevalueofQ,on

Q=FPPpred=FPTP+FP,      if Ppred>0=0,      if Ppred=0

precision=1−FDR=1−QPpred=0 Q=0

average,i.e.,tocomputetheexpectedvalueofQandusethatasourfalsediscoveryrate.Formally,

TheFDRisausefulmetricforensuringthattherateoffalsepositivesislow,especiallyincaseswherethepositivesarehighlyskewed,i.e.,thenumberofactualpositivesinthecollectionofresults, ,isverysmallcomparedtothenumberofactualnegatives, .

Benjamini-HochbergProcedureStatisticaltestingproceduresthattrytocontroltheFDRareknownasFDRcontrollingprocedures.Theseprocedurescantypicallyensurealownumberoffalsepositives(evenwhenthepositiveclassisrelativelyinfrequent)whileprovidinghigherpowerthanthemoreconservativeFWERcontrollingprocedures.Awidely-usedFDRcontrollingprocedureistheBenjamini-Hochberg(BH)procedure,whichsortstheresultsinincreasingorderoftheirp-valuesandusesadifferentsignificancelevel, ,foreveryresult, .

ThebasicideabehindtheBHprocedureisthatifwehaveobservedalargenumberofsignificantresultsthathavealowerp-valuethanagivenresult, ,wecanbelessstringentwhiletesting anduseamorerelaxedsignificancelevelthan .Algorithm10.1 providesasummaryoftheBHprocedure.Thefirststepinthisalgorithmistocomputethep-valuesforeveryresultandsorttheresultsinincreasingorderoftheirp-values(steps1to2).Thus,wouldcorrespondtothe smallestp-value.Thesignificancelevel, ,for isthencomputedusingthefollowingcorrection(line3)

FDR=E(Q). (10.2)

m0m1

α(i) Ri

RiRi

α/m

piith αi pi

αi=i×αm.

Noticethatthesignificancelevelforthesmallestp-value, ,isequalto ,whichissameasthecorrectionusedintheBonferroniprocedure.Further,thesignificancelevelforthelargestp-value, ,isequaltoα,whichisthesignificancelevelforasingletest(withoutaccountingformultiplehypothesistesting).Inbetweenthesetwop-values,thesignificancelevelgrowslinearlyfrom toα.Hence,theBHprocedurecanbeviewedasstrikingabalancebetweentheoverlyconservativeBonferroniprocedureandtheoverlyliberalnaїveapproach,thusresultinginhigherpower(findingmoreactualpositives)withoutproducingtoomanyfalsepositives.Letkbethelargestindexsuchthat islowerthanitssignificancelevel, (line4).TheBHprocedurethendeclaresthefirstkp-valuesassignificant(lines4to5).ItcanbeshownthattheFDRcomputedusingtheBHprocedureisguaranteedtobesmallerthanα.Inparticular,

where isthenumberofactualnegativeresultsandmisthetotalnumberofresults.(SeeTable10.1 .)

Algorithm10.1Benjamini-Hochberg(BH)

FDRalgorithm.

p1 α/m

pm

α/m

pk αk

FDR≤m0mα≤α, (10.3)

m0

1:Computep-valuesforthemresults.2:Orderthep-valuesfromsmallesttolargest( to ).3:Computethesignificancelevelfor as4:Letkbethelargestindexsuchthat .5:Reject forallresultscorrespondingtothefirstkp-values,

.

p1 pmpi αi=i×αm.pk≤αk

H0pi,1≤i≤k

Example10.7(BHandBonferroniprocedure).ConsiderthemultiplestockbrokerproblemdiscussedinExample10.6 ,whereinsteadofassumingallmstockbrokerstobelongtothenulldistribution,wemayhaveasmallnumberof candidateswhobelongtoanalternativedistribution.Thenulldistributioncanbemodeledbythebinomialdistributionwith0.5probabilityofmakingasuccessfuldecision.Thealternativedistributioncanbemodeledbythebinomialdistributionwith0.55probabilityofsuccess,whichisaslightlyhigherprobabilityofsuccessthanthatofrandomguessing.Weconsider decisionsforboththenullandalternativedistributions.

WeareinterestedincomparingtheperformanceoftheBonferroniandBHproceduresindetectingalargefractionofactualpositives(stockbrokersthatindeedperformbetterthanrandomguessing)withoutincurringalotoffalsepositives.Weran simulationsofmstockbrokerswherestockbrokersineachsimulationbelongtothealternativedistributionwhiletherestbelongtothenulldistribution.Wechose todemonstratetheeffectofaskewedpositiveclass,whichisquitecommoninmostpracticalapplicationsofmultiplehypothesistesting.Figure10.7 showstheplotsofFDRandexpectedpoweraswevarythenumberofstockbrokersineachsimulationrun,m,forthreecompetingprocedures:thenaїveapproach,theBonferroniprocedure,andtheBHprocedure.Thechoiceofthethreshold,α,ineachofthethreeprocedureswaschosentobe0.05.

m1

N=100

106 m1

m1=m/3

Figure10.7.Comparingtheperformanceofmultiplecomparisonproceduresaswevarythenumberofresults,m,andset resultsaspositive. foreachofthethreeprocedures.

m1=m/3 α=0.05

WecanseethattheFDRofboththeBonferroniandtheBHproceduresaresmallerthan0.05forallvaluesofm,buttheFDRofthenaїveapproachisnotcontrolledandiscloseto0.1.Thisshowsthatthenaїveapproachisquiterelaxedincallingaresulttobepositive,andthusincursmorefalsepositives.However,italsogenerallyshowsahighpowerasmanyoftheactualpositivesareindeedlabeledaspositive.Ontheotherhand,theFDRoftheBonferroniismuchsmallerthan0.05andisthelowestamongallthethreeapproaches.ThisisbecausetheBonferroniapproachaimsatcontrollingamorestringenterrormetric,i.e.,theFWER,tobesmallerthan0.05.However,italsohaslowpowerasitisconservativeincallinganactualpositivetobesignificantsinceitsgoalistoavoidanyfalsepositives.

TheBHproceduremakesabalancebetweenbeingconservativeandrelaxedsuchthatitsFDRisalwayssmallerthan0.05butitsexpectedpowerisquitehighandcomparabletothenaїveapproach.Hence,atthecostofaminorincreaseintheFDRovertheBonferroniprocedure,itisabletoachievemuchhigherpowerandthusobtainabettertrade-offbetweenminimizingtypeIerrorsandtypeIIerrorsinmultiplehypothesistestingproblems.However,weemphasize,thatFWERprocedures,suchasBonferroni,andFDRcontrollingprocedures,suchasBH,areintendedfortwodifferenttasks,andthus,thebestproceduretouseinanyparticularsituationwillvarydependingonthegoaloftheanalysis.

Equation10.3 statesthattheFDRoftheBHprocedureislessthanorequalto ,whichisequaltoαonlywhen ,i.e.,whentherearenoactualpositivesintheresults.Hence,theBHproceduregenerallydiscoversasmallernumberoftruepositives,i.e.,haslowerpower,thanitshouldbegivenadesiredFDRofα.ToaddressthislimitationoftheBHprocedure,anumberofstatisticaltestingprocedureshavebeendevelopedto

m0/m×α m0=m

providetightercontroloverFDR,suchasthepositiveFDRcontrollingprocedures,andthelocalFDRcontrollingprocedures.ThesetechniquesgenerallyshowbetterpowerthantheBHprocedurewhileensuringasmallnumberoffalsepositives.(SeeBibliographicNotesformoredetails.)

NotethatsomeusersofFDRcontrollingproceduresassumethatαshouldbechoseninthesamewayasforhypothesis(significance)testingorforFWERcontrollingprocedures,whichtraditionallyuse or .However,forFDRcontrollingprocedures,αisthedesiredfalsediscoveryrateandisoftenchosentohaveavaluegreaterthan0.05,e.g.,0.20.Thereasonforthisisthatinmanycasesthepersonevaluatingtheresultsiswillingtoacceptmorefalsepositivesinordertogetmoretruepositives.Thisisespeciallytruewhenfew,ifany,potentialpositiveresultsareproducedwhenαissettoalowvalue,suchas0.05or0.01.Inthepreviousexample,wechoseαtobethesameforallthreetechniquestokeepthediscussionsimple.

10.1.4PitfallsinStatisticalTesting

Thestatisticaltestingapproachespresentedaboveprovideaneffectiveframeworkformeasuringtheevidenceinresults.However,aswithotherdataanalysistechniques,usingthemincorrectlycanoftenproducemisleadingconclusions.Muchofthemisunderstandingiscenteredontheuseofp-values.Inparticular,p-valuesarecommonlyassignedadditionalmeaningbeyondwhatcanbesupportedbythedataandtheseprocedures.Inthefollowing,wediscusssomeofthecommonpitfallsinstatisticaltestingthatshouldbeavoidedtoproducevalidresults.Someofthesedescribep-valuesandtheirproperrolewhileothersidentifycommonmisinterpretationsandmisuses.(SeeBibliographicNotesformoredetails.)

α=0.05 α=0.01

1. Ap-valueisnottheprobabilitythatthenullhypothesisistrue.AsdescribedpreviouslyinDefinition10.2 ,thep-valueistheconditionalprobabilityofobservingaparticularvalueofateststatistic,R,orsomethingmoreextremeunderthenullhypothesis.Hence,weareassumingthenullhypothesisistrueinordertocomputethep-value.Ap-valuedoesmeasurehowcompatibletheobservedresultiswiththenullhypothesis.

2. Typically,therecanbemanyhypothesesthatexplainaresultthatisfoundtobesignificantornon-significantunderthenullhypothesis.Notethataresultthatisdeclarednon-significant,i.e.,hasahighp-value,wasnotnecessarilygeneratedfromthenulldistribution.Forexample,ifwemodelthenulldistributionusingaGaussiandistributionwithmean0andstandarddeviation1,wewillfindanobservedteststatistic, ,tobenon-significantata5%level.However,theresultcouldbefromthealternativedistribution,evenifthereisalow(butnonzero)probabilityofthatevent.Further,ifwemisspecifiedournullhypothesis,thenthesameobservationcouldhaveeasilycomefromanotherdistribution,e.g.,aGaussiandistributionwithmean1andstandarddeviation1,underwhichitismorelikely.Hence,declaringaresulttobenon-significantdoesnotamountto“accepting”thenullhypothesis.Likewise,asignificantresultmaybeexplainedbymanyalternativehypothesis.Hence,rejectingthenullhypothesisdoesnotnecessarilyimplythatwehaveacceptedthealternativehypothesis.Thisisoneofthereasonsthatp-values,ormoregenerallytheresultofstatisticaltesting,arenotusuallysufficientformakingdecisions.Factorsexternaltothestatistics,suchasdomainknowledge,mustbeappliedaswell.

3. Alowp-valuedoesnotimplyausefuleffectsize(magnitudeoftheteststatistic)andviceversa.Recallthattheeffectsizeisthesizeoftheteststatisticforwhichtheresultisconsideredimportantinthedomainofinterest.Thus,effectsizebringsindomainconsiderationsby

Robs=1.5

consideringwhethertheobservedresultissignificantfromadomainpointofview.Forexample,supposethatanewdrugisfoundtolowerbloodpressure,butonlyby1%.Thisdifferencemaybestatisticallysignificant,butthemedicalsignificanceofaneffectsizeof1%isprobablynotworththecostofthemedicineandthepotentialforsideeffects.Inparticular,asignificantp-valuemaynothavealargeeffectsizeandanon-significantp-valuedoesnotimplynoeffectsize.Sincep-valuesdependverystronglyonthesizeofthedataset,smallp-valuesforbigdataapplicationsarebecomingincreasinglycommonsinceevensmalleffectsizeswillshowupasbeingstatisticallysignificant.Thus,itbecomescriticaltotakeeffectsizeintoconsiderationtoavoidgeneratingresultsthatarestatisticallysignificantbutnotuseful.Inparticular,evenifaresultisdeclaredsignificant,weshouldensurethatitseffectsizeisgreaterthanadomain-specifiedthresholdtobeofpracticalimportance.

Example10.8(Significantp-valuesinrandomdata).Toillustratethepointthatwecanobtainsignificantlylowp-valuesevenwithsmalleffectsizes,weconsiderthepairwisecorrelationof10randomvectorsthatweregeneratedfromaGaussiandistributionofmean0andstandarddeviation1.Thenullhypothesisisthatthecorrelationofanytwovectorsis0.Figures10.8a and10.8b showthatasthelengthofthevectors,n,increases,themaximumabsolutepairwisecorrelationofanypairofvectorstendsto0,buttheaveragenumberofpairwisecorrelationsthathaveap-valuelessthan0.05remainsconstantatabout2.25.Thisshowsthatthenumberofsignificantpair-wisecorrelationsdoesnotdecreasewhennislarge,althoughtheeffectsize(maximumabsolutecorrelation)isquitelow.

Thisexamplealsoillustratestheimportanceofadjustingformultipletests.Thereare45pairwisecorrelationsandthus,onaverage,wewouldexpect

significantcorrelationsatthe5%level,asisshowninFigure10.8b .0.05×45=2.25

Figure10.8.

Visualizingtheeffectofchangingthevectorlength,n,onthecorrelationsamong10randomvectors.

4. Itisunsoundtoanalyzearesultinmultiplewaysuntilweareabletodeclareitstatisticallysignificant.Thiscreatesamultiplehypothesistestingproblemandthep-valuesofindividualresultsarenolongeragoodguideastowhetherthenullhypothesisshouldberejectedornot.(Suchapproachesareknownasp-valuehacking.)Thiscanincludecaseswherep-valuesarenotexplicitlyusedbutthedataispre-processedoradjusteduntilamodelisfoundthatisacceptabletotheinvestigator.

10.2ModelingNullandAlternativeDistributionsAprimaryrequirementforconductingstatisticaltestingistoknowhowtheteststatisticisdistributedunderthenullhypothesis(andsometimesunderthealternativehypothesis).Inconventionalproblemsofstatisticaltesting,thisconsiderationiskeptinmindwhiledesigningtheexperimentalsetupforcollectingdata,sothatwehaveenoughdatasamplespertainingtothenullandalternativehypotheses.Forexample,inordertotesttheeffectofanewdrugincuringadisease,experimentaldataisusuallycollectedfromtwogroupsofsubjectsthatareassimilaraspossibleinallrespects,exceptthatonegroupisadministeredthedrugwhilethecontrolgroupisnot.Thedatasamplesfromthetwogroupsthenprovideinformationtomodelthealternativeandnulldistributions,respectively.Thereisanextensivebodyofworkonexperimentaldesignthatprovidesguidelinesforconductingexperimentsandcollectingdatapertainingtothenullandalternativehypotheses,sothattheycanbeusedlaterforstatisticaltesting.However,suchguidelinescannotbedirectlyappliedwhendealingwithobservationaldatawherethedataiscollectedwithoutapriorhypothesisinmind,asiscommoninmanydataminingproblems.

Hence,acentralobjectivewhenusingstatisticaltestingwithobservationaldataistocomeupwithanapproachtomodelthedistributionoftheteststatisticunderthenullandalternativehypotheses.Insomecases,thiscanbedonebymakingsomestatisticalassumptionsabouttheobservations,e.g.,thatthedatafollowsaknownstatisticaldistributionsuchasthenormal,binomial,orhypergeometricdistributions.Forexample,theinstancesinadatasetmaybegeneratedbyasinglenormaldistributionwhosemeanand

variancecanbeestimatedfromthedataset.Notethatinalmostallcaseswhereastatisticalmodelisused,theparametersofthatmodelmustbeestimatedfromthedata.Hence,anyprobabilitiescalculatedusingastatisticalmodelmayhavesomeinherenterror,withthemagnitudeofthaterrordependentonhowwellthechosendistributionfitsthedataandhowwelltheparametersofthemodelcanbeestimated.

Insomescenarios,itisdifficult,orevenimpossible,toadequatelymodelthebehaviorofthedatawithaknownstatisticaldistribution.Analternativemethodistofirstgeneratesampledatasetsunderthenulloralternativehypotheses,andthenmodelthedistributionoftheteststatisticusingthenewdatasets.Forthealternativehypothesis,thenewdatasetsmustbesimilartothecurrentdataset,butshouldreflectthenaturalvariabilityinherentinthedata.Forthenullhypothesis,thesedatasetsshouldbeassimilaraspossibletotheoriginaldataset,butlackthestructureorpatternofinterest,e.g.,aconnectionbetweenattributesandvalues,clusterstructure,orassociationsbetweenattributes.

Inthefollowing,wedescribesomegenericproceduresforestimatingthenulldistributioninthecontextofstatisticaltestingfordataminingproblems.(Unfortunately,outsideofusingaknownstatisticaldistribution,therearenotmanywidelyusedmethodsforgeneratingthealternativedistribution.)TheseprocedureswillserveasthebuildingblocksofthespecificapproachesforstatisticaltestingdiscussedinSections10.3 to10.6 .Notethattheexactdetailsoftheapproachusedforestimatingthenulldistributiondependsonthespecifictypeofproblembeingstudiedandthenatureofhypothesesbeingconsidered.However,atahighlevel,approachesinvolvegeneratingcompletelynewsyntheticdatasetsorrandomizinglabels.Inaddition,wewilldiscussapproachesforresamplingexistinginstances,whichcanbeusefulforgeneratingconfidenceintervalsforvariousdataminingresults,suchastheaccuracyofapredictivemodel.

10.2.1GeneratingSyntheticDataSets

Foranalysesinvolvingunlabeleddatasuchasclusteringandfrequentpatternmining,themainapproachforestimatinganulldistributionistogeneratesyntheticdatasets,eitherbyrandomizingtheorderofattributevaluesorbygeneratingnewinstances.Theresultantdatasetsshouldbesimilartotheoriginaldatasetinallmannersexceptthattheylackapatternofinterest,e.g.,clusterstructureorfrequentpatterns,whosesignificancehastobeassessed.

Forexample,ifweneedtoassesswhethertheitemsinatransactiondatasetarerelatedtoeachotherbeyondwhateverassociationoccursbyrandomchance,wecangeneratesynthetictransactiondatasetsbyrandomizing(permuting)theorderofexistingentrieswithinrowsandcolumnsofthebinaryrepresentationofatransactiondataset.Thegoalisthattheresultingdatasetshouldhavepropertiessimilartotheoriginaldatasetintermsofthenumberoftimesanitemoccursinthetransactions(i.e.,thesupportofanitem)andthenumberofitemsineverytransaction(i.e.,thelengthofatransaction),buthavestatisticallyindependentitems.Thesesyntheticdatasetscanbeprocessedtofindassociationpatternsandtheseresultscanbeusedtoprovideanestimateofthedistributionoftheteststatisticofinterest,e.g.,thesupportorconfidenceofanitemset,underthenullhypothesis.SeeSection10.4 formoredetails

Ifweneedtoevaluatewhethertheclusterstructureinadatasetisbetterthanwemightexpectbyrandomchance,weneedtogeneratenewinstancesthat,whencombinedinadataset,lackclusterstructure.Thesyntheticdatasetscanbeclusteredandusedtoestimatethenulldistributionofteststatistic.Forclusteranalysis,thequantityofinterest,i.e.,theteststatistic,istypicallysomemeasureofclusteringgoodness,suchasSSEorthesilhouettecoefficient.SeeSection10.5 .

Althoughtheprocessofrandomizingattributesmayappearsimple,executingthisapproachcanbeverychallengingsinceanäiveattemptatgeneratingsyntheticdatasetsmayomitimportantcharacteristicsorstructureoftheoriginaldataandthusmayyieldaninadequateapproximationofthetruenulldistribution.Forexample,givenatimeseriesdata,weneedtoensurethattheconsecutivevaluesintherandomizedtimeseriesaresimilartoeachother,sincetimeseriesdatatypicallyexhibittemporalautocorrelation.Further,ifthetimeseriesdatahaveayearlycycle(e.g.,inclimatedata),wewouldneedtoensurethatsuchcyclicpatternsarealsopreservedinthesyntheticallygeneratedtimeseries.

SpecifictechniquesforgeneratingsyntheticdatasetswillbediscussedinmoredetailinthecontextofassociationanalysisandclusteringinSections10.4 and10.5 ,respectively.

10.2.2RandomizingClassLabels

Wheneverydatainstancehasanassociatedclasslabel,acommonapproachforgeneratingnewdataistorandomlypermutetheclasslabels,aprocessalsoreferredtoaspermutationtesting.Thisinvolvesrepeatedlyshuffling(permuting)labelsamongdataobjectsatrandomtogenerateanewdatasetthatisidenticaltotheolddatasetexceptforthelabelassignments.Aclassificationmodelisbuiltoneachofthesedatasetsandateststatisticcalculated,e.g.,classificationaccuracy.Theresultingsetofvalues—oneforeachpermutation—canbeusedtoprovideadistributionforthestatisticunderthenullhypothesisthattheattributesinthedatasethavenorelationshipwiththeclasslabels.AswillbedescribedinSection10.3.1 ,thisapproachcanbeusedtotesthowlikelyisittoachievetheclassificationperformanceofalearnedclassifieronatestsetjustbyrandomchance.Althoughpermutingthe

labelsissimple,itcanresultininferiormodelsofthenulldistribution.(SeeBibliographicNotes.)

10.2.3ResamplingInstances

Ideally,wewouldliketohavemultiplesamplesfromtheunderlyingpopulationofdatainstancessothatwecanassessthevalidityandgeneralizabilityofthemodelsandpatternsourdataminingalgorithmsproduce.Onewaytosimulatesuchsamplesistorandomlysampleinstancesfromtheoriginaldatatocreatesyntheticcollectionsofdatainstances—anapproachcalledstatisticalresampling.Forexample,acommonapproachforgeneratingnewdatasetsistousebootstrapsampling,wheredatainstancesarerandomlyselectedwithreplacementsuchthattheresultantdatasetisofthesamesizeastheoriginalset.Forclassification,analternativetobootstrapsamplingisk-foldcross-validation,wherethedatasetissystematicallysplitintosubsetsforknumberoftimes.AswewillseelaterinSection10.3.1 ,suchstatisticalresamplingapproachesareusedtocomputedistributionsofmeasuresofclassificationperformance,suchasaccuracy,precision,andrecall.Resamplingapproachessuchasthebootstrapcanalsobeusedtoestimatethedistributionofthesupportofafrequentitemset.Wecanalsousethesedistributionstoproduceconfidenceintervalsforthesemeasures.

10.2.4ModelingtheDistributionoftheTestStatistic

Givenmultiplesamplesofdatasetsgeneratedunderthenullhypothesis,wecancomputetheteststatisticoneverysetofsamplestoobtainthenull

distributionoftheteststatistic.Thisdistributioncanbeusedforprovidingestimatesoftheprobabilitiesusedinstatisticaltestingprocedures,suchasp-values.Onewaytoachievethisistofitstatisticalmodels,e.g.,thenormalorthebinomialdistribution,ontheteststatisticvaluesfromthedatasetsgeneratedunderthenullhypothesis.Alternatively,wecanalsomakeuseofnon-parametricapproachesforcomputingp-values,givenenoughsamples.Forinstance,wecancountthefractionoftimestheteststatisticgeneratedunderthenulldistributionexceeds(ortakes“moreextreme”valuesthan)theteststatisticoftheobservedresult,andusethisfractionasthep-valueoftheresult.

10.3StatisticalTestingforClassificationThereareanumberofproblemsinclassificationthatcanbeviewedfromtheperspectiveofstatisticaltestingandthuscanbenefitfromthetechniquesdescribedpreviouslyinthischapterforavoidingfalsediscoveries.Inthefollowing,wediscusssomeoftheseproblemsandthestatisticaltestingapproachesthatcanbeusedtoaddressthem.NotethatapproachesforcomparingwhethertheperformanceoftwomodelsissignificantlydifferentisprovidedinSection3.9.2 .

10.3.1EvaluatingClassificationPerformance

Supposethataclassifierappliedtoatestsetshowsanaccuracyofx%.Inordertoassessthevalidityoftheclassifier’sresults,itisimportanttounderstandhowlikelyitistoobtainx%accuracybyrandomchance,i.e.,whenthereisnorelationshipbetweentheattributesinthedatasetandtheclasslabel.Also,ifwechooseacertainthresholdfortheclassificationaccuracytoidentifyeffectiveclassifiers,thenwewouldliketoknowhowmanytimeswecanexpecttofalselyrejectagoodclassifierthatshowsanaccuracylowerthanthethresholdduetothenaturalvariabilityinthedata.

Suchquestionsaboutthevalidityofaclassifier’sperformancecanbeaddressedbyviewingthisproblemfromtheperspectiveofhypothesistesting

asfollows.Considerastatisticaltestingsetupwherewelearnaclassifieronatrainingsetandevaluatethelearnedclassifieronatestset.Thenullhypothesisforthistestisthattheclassifierisnotabletolearnageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthegiventrainingset.Thealternativehypothesisisthattheclassifierindeedlearnsageneralizablerelationshipbetweentheattributesandtheclasslabelsfromthetrainingset.Toevaluatewhetheranobservedresultbelongstothenullhypothesisorthealternativehypothesis,wecanuseameasureoftheclassifier’sperformanceonthetestset,e.g.,precision,recall,oraccuracy,astheteststatistic.

Randomization

Inordertoperformstatisticaltestingusingtheabovesetup,wefirstneedtogeneratenewsampledatasetsunderthenullhypothesisthattherearenonon-randomrelationshipsbetweentheattributesandclasslabels.Thiscanbeachievedbyrandomlypermutingtheclasslabelsofthetrainingdatasuchthatforeverypermutationofthelabels,weproduceanewtrainingsetwheretheattributesandclasslabelsareunrelatedtoeachother.Wecanthenlearnaclassifieroneverysampletrainingsetandapplythelearnedmodelsonthetestsettoobtainanulldistributionoftheteststatistic(classificationperformance).Then,forexample,ifweuseaccuracyasourteststatistic,theobservedvalueofaccuracyforthemodellearnedusingoriginallabelsshouldbesignificantlyhigherthanmostoralloftheaccuraciesgeneratedbymodelslearnedoverrandomlypermutedlabels.However,notethataclassifiermayhaveasignificantp-valuebuthaveanaccuracyonlyslightlybetterthanarandomclassifier,especiallyifthedatasetislarge.Hence,itisimportanttotaketheeffectsizeoftheclassifier(actualvalueofclassificationperformance)intoaccountalongwithinformationaboutitsp-value.

BootstrapandCross-Validation

Anothertypeofanalysisrelevanttopredictivemodels,suchasclassification,istomodelthedistributionofvariousmeasuresofclassificationperformance.Onewaytoestimatesuchdistributionsistogeneratebootstrapsamplesfromthelabeleddata(preservingtheoriginallabels)tocreatenewtrainingandtestsets.Theperformanceofaclassificationmodeltrainedandevaluatedonanumberofthesebootstrappeddatasetscanthenbeusedtogenerateadistributionforthemeasureofinterest.Anotherwaytocreatethealternativedistributionwouldbetousetherandomizedcross-validationprocedure(discussedinSection3.6.2 )wheretheprocessofrandomlypartitioningthelabeleddataintokfoldsisrepeatedmultipletimes.

Suchresamplingapproachescanalsohelpinestimatingconfidenceintervalsformeasuresofthetrueperformanceoftheclassifiertrainedoverallpossibleinstances.Aconfidenceintervalisanintervalofparametervaluesinwhichanestimatedparametervalueisguaranteedtofallacertainpercentageoftimes.Theconfidencelevelisthepercentageoftimestheestimatedparameterwillfallwithintheinterval.Forexample,giventhedistributionofaclassifier’saccuracy,wecanestimatetheintervalofvaluesthatcontains95%ofthedistribution.Thisservesastheconfidenceintervaloftheclassifier’strueaccuracyatthe95%confidencelevel.Toquantifytheinherentuncertaintyintheresult,confidenceintervalsareoftenreportedalongwithpointestimatesofamodel’soutput.

10.3.2BinaryClassificationasMultipleHypothesisTesting

TheprocessofestimatingthegeneralizationperformanceofabinaryclassifierresemblestheproblemofmultiplehypothesistestingdiscussedpreviouslyinSection10.1.2 .Inparticular,everytestinstancebelongstothenullhypothesis(negativeclass)orthealternativehypothesis(positiveclass).Byapplyingaclassificationmodeloneverytestinstance,weassigneachinstancetothepositiveorthenegativeclass.Theperformanceofaclassificationmodelonasetofresults(resultsofclassifyinginstancesinatestset)canthenbesummarizedbythefamiliarconfusionmatrixpresentedinTable10.1 .

Auniqueaspectofbinaryclassificationthatdifferentiatesitfromconventionalproblemsofmultiplehypothesistestingistheavailabilityofgroundtruthlabelsontestinstances.Hence,insteadofmakinginferencesusingstatisticalassumptions(e.g.,thedistributionoftheteststatisticunderthenullandalternativehypothesis),wecandirectlycomputeerrorestimatesforrejectingthenulloralternativehypothesesusingempiricalmethods,suchasthosepresentedinSection4.11.2 .Table10.2 showsthecorrespondencebetweentheerrormetricsusedinstatisticaltestingandevaluationmeasuresusedinclassificationproblems.

Table10.2.Correspondencebetweenstatisticaltestingconceptsandclassifierevaluationmeasures

StatisticalTestingConcept ClassifierEvaluationMeasure Formula

TypeIErrorRate, FalsePositiveRate

TypeIIErrorRate, FalseNegativeRate

Power, Recall

α FPFP+TN

β FNTP+FN

1−β TPTP+FN

Whiletheseerrormetricscanbereadilycomputedwiththehelpoflabeleddata,thereliabilityofsuchestimatesdependsontheaccuracyoftestlabelswhichmaynotalwaysbeperfect.Insuchcases,itisimportanttoquantifytheuncertaintyintheevaluationmeasuresarisingduetoinaccuraciesinthetestlabels.(SeeBibliographicNotesformoredetails.)Further,whenweapplyalearnedclassificationmodelonunlabeledinstances,wecanusestatisticalmethodsforquantifyingtheuncertaintyintheclassificationoutputs.Forexample,wecanbootstrapthetrainingset(asdiscussedinSection10.3.1 )togeneratemultipleclassificationmodels,andthedistributionoftheiroutputsonanunseeninstancecanbeusedtoestimatetheconfidenceintervaloftheoutputonthatinstance.

Althoughtheabovediscussionwasfocusedonassessingthequalityofaclassifierthatproducesbinaryoutputs,statisticalconsiderationscanalsobeusedtoassessthequalityofaclassifierthatproducesreal-valuedoutputssuchasclassificationscores.TheperformanceofaclassifieracrossarangeofscorethresholdsisgenerallyanalyzedwiththehelpofReceiverOperatingCharacteristiccurves,asdiscussedinSection4.11.4 .ThebasicapproachbehindgeneratinganROCcurveistosortthepredictionsaccordingtotheirscorevaluesandthenplotthetruepositiverateandthefalsepositiverateforeverypossiblevalueofscorethreshold.NotethatthisapproachbearssomeresemblancetotheFDRcontrollingproceduresdescribedinSection10.1.3 ,wherethetopfewrankinginstances(withthelowestp-values)arelabeledaspositivebytheclassifierinordertomaximizetheFDR.However,inthepresenceofgroundtruthlabels,wecanempiricallyestimatemeasuresofclassificationperformancefordifferentscorethresholdswithoutmakinguseofanyexplicitstatisticalmodelsorassumptions.

10.3.3MultipleHypothesisTestingin

ModelSelection

Theproblemofmultiplehypothesistestingplaysamajorroleintheprocessofmodelselection,whereevenifamorecomplexmodelshowsbetterperformancethanasimplermodel,thedifferenceintheirperformancesmaynotbestatisticallysignificant.Specifically,fromastatisticalperspective,amodelwithahighercomplexityoffersalargernumberofpossiblesolutionsthatalearningalgorithmcanchoosefrom,foragivenclassificationproblem.Forexample,havingalargernumberofattributesprovidesalargersetofcandidatesplittingcriteriathatadecisiontreelearningalgorithmcanchoosetobestfitthetrainingdata.However,whenthetrainingsizeissmallandthenumberofcandidatemodelsarelarge,thereisahigherchanceofpickingaspuriousmodel.Moregenerally,thisversionofthemultiplehypothesistestingisknownasselectiveinference.Thisproblemarisesinsituationswherethenumberofpossiblesolutionsforagivenproblem,suchasbuildingapredictivemodel,arenumerous,butthenumberofteststorobustlydeterminetheefficacyofasolutionisquitesmall.SelectiveinferencemayleadtothemodeloverfittingproblemdescribedinSection3.4 .

Howdoesthemultiplecomparisonprocedurerelatetomodeloverfitting?Manylearningalgorithmsexploreasetofindependentalternatives, ,andthenchooseanalternative, ,thatmaximizesagivencriterionfunction.Thealgorithmwilladd tothecurrentmodelinordertoimproveitstrainingerror.Thisprocedureisrepeateduntilnofurtherimprovementisobserved.Asanexample,duringdecisiontreegrowing,multipletestsareperformedtodeterminewhichattributecanbestsplitthetrainingdata.Theattributethatleadstothebestsplitischosentoextendthetreeaslongasthestoppingcriterionhasnotbeensatisfied.

{γi}γmaxγmax

Let betheinitialdecisiontreeand bethenewtreeafterinsertinganinternalnodeforattributex.Considerthefollowingstoppingcriterionforadecisiontreeclassifier:xisaddedtothetreeiftheobservedgain,,isgreaterthansomepredefinedthreshold .Ifthereisonlyoneattributetestconditiontobeevaluated,thenwecanavoidinsertingspuriousnodesbychoosingalargeenoughvalueof .However,inpractice,thereismorethanonetestconditionavailableandthedecisiontreealgorithmmustchoosethebestsplittingattribute fromasetofcandidates, .Themultiplecomparisonproblemarisesbecausethealgorithmappliesthefollowingtest, insteadof ,todecidewhetheradecisiontreeshouldbeextended.Justaswiththemultiplestockbrokerexample,asthenumberofalternativeskincreases,sodoesourchanceoffinding .Unlessthegainfunction orthreshold ismodifiedtoaccountfork,thealgorithmmayinadvertentlyaddspuriousnodeswithlowpredictivepowertothetree,whichleadstothemodeloverfittingproblem.

Thiseffectbecomesmorepronouncedwhenthenumberoftraininginstancesfromwhich ischosenissmall,becausethevarianceofishigherwhenfewertraininginstancesareavailable.Asaresult,the

probabilityoffinding increaseswhenthereareveryfewtraininginstances.Thisoftenhappenswhenthedecisiontreegrowsdeeper,whichinturnreducesthenumberofinstancescoveredbythenodesandincreasesthelikelihoodofaddingunnecessarynodesintothetree.

T0 Tx

Δ(T0,Tx) α

α

xmax {x1,x2,…,xk}

Δ(T0,Txmax)>α Δ(T0,Tx)>α

Δ(T0,Txmax)>α Δα

xmax Δ(T0,Txmax)

Δ(T0,Txmax)>α

10.4StatisticalTestingforAssociationAnalysisSinceproblemsinassociationanalysisareusuallyunsupervised,i.e.,wedonothaveaccesstogroundtruthlabelstoevaluateresults,itisimportanttoemployrobuststatisticaltestingapproachestoensurethatthediscoveredresultsarestatisticallysignificantandnotspurious.Forexample,inthediscoveryoffrequentitemsets,weoftenuseevaluationmeasuressuchasthesupportofanitemsettomeasureitsinterestingness.(Theuncertaintyinsuchevaluationmeasurescanbequantifiedbyusingresamplingmethods,e.g.,bybootstrappingthetransactionsandgeneratingadistributionofthesupportofanitemsetfromtheresultingdatasets.)Givenasuitableevaluationmeasure,wealsoneedtospecifyathresholdonthemeasuretoidentifyinterestingpatternssuchasfrequentitemsets.Althoughthechoiceofarelevantthresholdisgenerallyguidedbydomainconsiderations,itcanalsobeinformedwiththehelpofstatisticalprocedures,aswediscussinthefollowing.Tosimplifythisdiscussion,weassumethatourtransactiondatasetisrepresentedasasparsebinarymatrix,with1’srepresentingthepresenceofitemsand0’srepresentingtheirabsence.(SeeSection2.1.2 .)

Givenatransactiondataset,consideraresulttobethediscoveryofafrequentk-itemsetandtheteststatistictobethesupportoftheitemsetoranyotherevaluationmeasuredescribedinSection5.7 .Thenullhypothesisforthisresultwouldbethatthekitemsinanitemsetareunrelatedtoeachother.Givenacollectionoffrequentitemsets,wecouldthenapplymultiplehypothesistestingmethodssuchastheFWERorFDRcontrollingprocedurestoidentifysignificantpatternswithstronglyassociateditems.However,theitemsetsfoundbyanassociationminingalgorithmoverlapintermsofthe

itemstheycontain.Hence,themultipleresultsinassociationanalysiscannotbeassumedtobeindependentofeachother.Forthisreason,approachessuchastheBonferroniproceduremaybeoverlyconservativeincallingaresultsignificant,whichleadstolowpower.Further,atransactiondatasetmayhavestructureorcharacteristics,e.g.,asubsetoftransactionscontainingalargenumberofitems,whichneedtobeaccountedforwhenapplyingmultiplehypothesistestingprocedures.

Beforewecanapplystatisticaltestingproceduresforproblemsrelatedtoassociationanalysis,wefirstneedtoestimatethedistributionoftheteststatisticofanitemsetunderthenullhypothesisofnoassociationamongtheitems.Thiscanbedonebyeithermakinguseofstatisticalmodelsorbyperformingrandomizationexperiments.Boththesecategoriesofapproachesaredescribedinthefollowing.

10.4.1UsingStatisticalModels

Underthenullhypothesisthattheitemsareunrelated,wecanmodelthesupportcountofanitemsetusingstatisticalmodelsofindependenceamongitems.Foritemsetscontainingtwoindependentitems,wecanuseFisher’sexacttest.Foritemsetscontainingmorethantwoitems,wecanusealternativetestsofindependencesuchasthechi-squared test.Boththeseapproachesareillustratedinthefollowing.

UsingFisher’sExactTestConsidertheproblemofmodelingthesupportcountofa2-itemset, ,underthenullhypothesisthatAandBoccurindependentlyofeachother.WearegivenadatasetwithNtransactionswhereAandBappear and

(χ2)

{A,B}

NA NB

times,respectively.AssumingthatAandBareindependent,theprobabilityofobservingAandBtogetherwouldthenbegivenby

where and aretheprobabilitiesofobservingAandBindividually,whichareapproximatedbytheirsupport.TheprobabilityofnotobservingAandBtogetherwouldthenbeequalto .AssumingthattheNtransactionsareindependent,wecanconsidermodeling ,thenumberoftimesAandBappeartogether,usingthebinomialdistribution(introducedinSection10.1 )asfollows:

However,thebinomialdistributiondoesnotaccuratelymodelthesupportcountof becauseitassigns, ,thesupportcountof ,apositiveprobabilityevenwhen exceedstheindividualsupportcountsofAandB.Morespecifically,thebinomialdistributionrepresentstheprobabilityofobservinganevent(co-occurrenceofAandB)whensampledwithreplacementwithafixedprobability .However,inreality,theprobabilityofobserving decreasesifwehavealreadysampledAandBanumberoftimes,becausethesupportcountsofAandBarefixed.

Fisher’sexacttestwasdesignedtohandletheabovesituationwhereweperformsamplingwithoutreplacementfromafinitepopulationoffixedsize.ThistestisreadilyexplainedusingthesameterminologyusedinSection5.7 ,whichdealtwiththeevaluationofassociationpatterns.Foreasyreference,wereproducethecontingencytable,Table5.6 ,whichwasusedinthatdiscussion.SeeTable10.3 .

Table10.3.A2-waycontingencytableforvariablesAandB.

pAB=pA×pB=NAN×NBN,

pA pB

(1−pAB)NAB

P(NAB=k)=(Nk)(pAB)k(1−pAB)N−k.

{A,B} NAB {A,B}NAB

(pAB){A,B}

B

A

N

Weusethenotation toindicatethatA(B)isabsentfromatransaction.Eachentry inthis tabledenotesafrequencycount.Forexample, isthenumberoftimesAandBappeartogetherinthesametransaction,while isthenumberoftransactionsthatcontainBbutnotA.Therowsum representsthesupportcountforA,whilethecolumnsum

representsthesupportcountforB.

NotethatifN,thenumberoftransactions,andthesupportsof andarefixed,i.e.,heldconstant,then and arefixed.Thisalso

impliesthatspecifyingthevalueforoneoftheentries, ,or,completelyspecifiestherestoftheentriesinthetable.Inthatcase,Fisher’sexacttestgivesusasimpleformulaforexactlycomputingtheprobabilityofanyspecificcontingencytable.Becauseofourintendedapplication,weexpresstheformulaintermsofthesupportcount, ,forthe2-itemset

.Notethat istheobservedsupportcountof .

Example10.9(Fisher’sExactTest).WeillustratetheapplicationofFisher’sexacttestusingthetea-coffeeexampledescribedatthebeginningofSection5.7.1 .Weareinterestedinmodelingthenulldistributionofthesupportcountof .As

f11

f01

f10

f00

f1+

f0+

f+1 f+0

A¯ (B¯)fij 2×2

f11f01

f1+ f+1

A (f1+)B (f+1) f0+ f+0

f11,f10,f01 f00

NAB{A,B} f11 {A,B}

P(NAB=f11)=(f1+f11)(f0+f+1−f11)(Nf+1) (10.4)

{Tea,Coffee}

describedinSection5.7.1 ,theco-occurrenceofTeaandCoffeecanbesummarizedusingthecontingencytableshowninTable10.4 .WecanseethatthesupportcountofCoffeeis800andthesupportcountofTeais200,outofatotalof1000transactions.

Table10.4.Beveragepreferencesamongagroupof1000people.

Coffee

Tea 150

650

50

150

200

800

800 200 1000

Tomodelthenulldistributionofthesupportcountof ,wesimplyapplyEquation10.4 fromourdiscussionFisher’sexacttest.Thisyieldsthefollowing.

where isthesupportcountof .

Figure10.9 showsaplotofthenulldistributionofthesupportcountfor.Wecanseethatthelargestprobabilityforsupportcount

occurswhenitisequalto160.AnintuitiveexplanationforthisfactisthatwhenTeaandCoffeeareindependent,theprobabilityofobservingTeaandCoffeetogetherisequaltotheproductoftheirindividualprobabilities,i.e., .Theexpectedsupportcountof isthusequalto .Supportcountsthatarelessthan160indicatenegativeassociationsamongitems.

Coffee¯

Tea¯

{Tea,Coffee}

P(NAB=f11)=(200f11)(800800−f11)(1000800),

NAB {Tea, Coffee}

{Tea,Coffee}

0.8×0.2=0.16 {Tea,Coffee}0.16×1000=160

Figure10.9.PlotoftheprobabilityofsupportcountgiventheindependenceofTeaandCoffee

Hence,thep-valueofasupportcountof150canbecalculatedbysummingtheprobabilitiesinthelefttailofthenulldistributionforsupportcount150andsmaller.Thisyieldsap-valueof0.032.Thisresultisnotconclusive,sinceasupportcountof150orlesswilloccurroughly3timesoutofa100,onaverage,ifTeaandCoffeeareindependent.However,thelowp-valuetendstoindicatethatteaandcoffeearerelated,albeitinanegativeway,i.e.,teadrinkersarelesslikelytodrinkcoffeethanthosewhodon’tdrinktea.Notethatthisisjustanexampleanddoesnotnecessarilyreflectreality.Alsonotethatthisfindingisconsistentwithour

previousanalysisusingalternativemeasures,suchastheinterestfactor(liftmeasure),asdescribedinSection5.7.1 .

Althoughthediscussionabovewascenteredonthesupportmeasure,wecanalsomodelthenulldistributionofanyotherobjectiveinterestingnessmeasureofa2-itemsetintroducedinSection5.7 ,suchasinterest,oddsratio,cosine,orall-confidence.Thisisbecausealltheentriesofthecontingencytablecanbeuniquelydeterminedbythesupportmeasureofthe2-itemset,giventhenumberoftransactionsandthesupportcountsofthetwoitems.Morespecifically,theprobabilitiesdisplayedinFigure10.9 aretheprobabilitiesofspecificcontingencytablescorrespondingtoaspecificvalueofsupportforthe2-itemset.Foreachofthesetables,thevaluesofanyobjectiveinterestingnessmeasure(oftwoitems)canbecalculated,andthesevaluesdefinethenulldistributionofthemeasurebeingconsidered.Thisapproachcanalsobeusedtoevaluateinterestingnessmeasuresofassociationrulessuchastheconfidenceof ,whereAandBareitemsets.

NotethatusingFisher’sexacttestisequivalenttousingthehypergeometricdistribution.

UsingtheChi-SquaredTestThechi-squared testprovidesagenericbutapproximateapproachformeasuringthestatisticalindependenceamongmultipleitemsinanitemset.Thebasicideabehindthe testistocomputetheexpectedvalueofeveryentryinacontingencytable,suchastheoneshowninTable10.4 ,assumingthattheitemsarestatisticallyindependent.Thedifferencesbetweentheobservedandexpectedvaluesinthecontingencytablecanthenbeusedtocomputeateststatisticthatfollowsthe distributionunderthenullhypothesisofnoassociationbetweenitems.

A→B

(χ2)

χ2

χ2

Formally,consideratwo-dimensionalcontingencytablewheretheentryattherowand columnisdenotedby .(Weusethenotation

insteadof sincetheformeristraditionallyusedtorepresentthe“observed”valueindiscussionsofthe statistic.)IfthesumofallentriesisequaltoN,thenwecancomputetheexpectedvalueateveryentryas

Thisfollowsfromthefactthatthejointprobabilityofobservingindependenteventsisequaltotheproductoftheindividualprobabilities.Whenallitemsarestatisticallyindependent, wouldusuallybecloseto forallvaluesofiandj.Hence,thedifferencesbetween and canbeusedtomeasurethedeviationoftheobservedcontingencytablefromthenullhypothesisofnoassociation.Inparticular,wecancomputethefollowingteststatistic:

NotethatR=0onlyif and areequalforeveryvalueofiandj.ItcanbeshownthatthenulldistributionofRcanbeapproximatedbythedistributionwith1degreeoffreedomwhenNislarge.Wecanthuscomputethep-valueofanobservedvalueofRusingstandardimplementationsofthedistribution.

Whiletheabovediscussionwascenteredontheanalysisofatwo-dimensionalcontingencytableinvolvingtwoitems,the testcanbereadilyextendedtomulti-dimensionalcontingencytablesinvolvingmorethantwoitems.Forexample,givenak-itemset ,wecanconstructak-dimensionalcontingencytablewithobservedentriesrepresentedas

.TheexpectedvaluesofthecontingencytableandtheteststatisticRcouldthenbecomputedasfollows

ith jth Oi,j(i,j∈{0,1}) Oi,jfij

χ2

Eij=N×(∑iOi,jN)×(∑jOi,jN). (10.5)

Oi,j Ei,jOi,j Ei,j

R=∑i∑j(Oi,j−Ei,j)2Ei,j. (10.6)

Oi,j Ei,jχ2

χ2

χ2

X={i1,i2,…,ik}Oi2,i2,

…,ik (i1,i2,…ik∈{0,1})

UnderthenullhypothesisthatallkitemsintheitemsetXarestatisticallyindependent,thedistributionofRcanagainbeapproximatedbyadistribution.However,thegeneralformulaforthedegreesoffreedomis

.Thus,ifwehavea4by3contingencytable,then .

10.4.2UsingRandomizationMethods

Whenitisdifficulttomodelthenulldistributionofitemsetsusingstatisticalmodels,analternativeapproachistogeneratesynthetictransactiondatasetsunderthenullhypothesisofnoassociationamongtheitems,withthesamenumberofitemsandtransactionsastheoriginaldata.Thisinvolvesrandomlypermutingtherowsorcolumnsintheoriginaldatasuchthattheitemsintheresultantdataareunrelatedtoeachother.AsdiscussedinSection10.2.1 ,wemustensurewhilerandomizingtheattributesthattheresultantdatasetsaresimilartotheoriginaldatasetinallrespectsexceptforthedesiredeffectweareinterestedinevaluating,whichistheassociationamongitems.

Abasicstructurewewouldliketopreserveinthesyntheticdatasetsisthesupportofeveryitemintheoriginaldata.Inparticular,everyitemshouldappearinthesamenumberoftransactionsinthesyntheticdatasetsasintheoriginaldataset.Onewaytopreservethissupportstructureofitemsistorandomlypermutetheentriesineachcolumnoftheoriginaldatasetindependentlyoftheothercolumns.Thisensuresthattheitemshavethesamesupportinthesyntheticallygenerateddatasetsbutareindependentof

Ei1,i2,…,ik=N×∏j=1k(∑ijOi1,i2,…,ikN). (10.7)

R=∑i1∑i2…∑ik(Oi1,i2,…,ik−Ei1,i2,…,ik)2Ei1,i2,…ik. (10.8)

χ2df=

(numberofrows −1)×(numberofcolumns−1)df=(4 −1)×(3−1)=6

eachother.However,thismayviolateadifferentpropertyoftheoriginaldatathatwewouldliketopreserve,whichisthelengthofeverytransaction(numberofitemsinatransaction).Thispropertycanbepreservedbyrandomlyshufflingtherows,i.e.,therowsumsarepreserved.However,adrawbackofthisapproachisthatthesupportofeveryitemintheresultantdatasetmaybedifferentthanthesupportofitemsintheoriginaldataset.

Arandomizationapproachthatcanpreserveboththesupportsandthetransactionlengthsoftheoriginaldataisswaprandomization.Thebasicideaistopickapairofonesintheoriginaldatasetfromtwodifferentrowsandcolumns,sayat(rowk,columni)and(rowl,columnj),where and .(SeelefttableinFigure10.10 .)Thesetwoentriesdefinethediagonalofarectangleofvaluesinthebinarytransactionmatrix.Iftheentriesatoppositecornersoftherectangle,i.e.,(rowk,columnj)and(rowl,columni),arezeros,thenwecanswapthesezeroswiththeones,asshowninFigure10.10 .Notethatbyperformingthisswap,boththerowsumsandcolumnsumsarepreservedwhiletheassociationwithotheritemsisbroken.Thisprocesscontinuesuntilitislikelythatthedatasetissignificantlydifferentfromtheoriginalone.(Anappropriatethresholdforthenumberofswapsneedstobedetermineddependingonthesizeandnatureoftheoriginaldataset.)

Figure10.10.

k≠l i≠j

Illustrationofaswapforswaprandomization.

Swaprandomizationhasbeenshowntopreservethepropertiesoftransactiondatasetsmoreaccuratelythantheotherapproachesmentioned.However,itisverycomputationallyintensive,particularlyforlargerdatasets,whichcanlimititsapplication.Furthermore,apartfromthesupportofitemsandtransactionlengths,theremaybeothertypesofstructureinthetransactiondatathatswaprandomizationmaynotbeabletopreserve.Forinstance,theremaybesomeknowncorrelationsamongtheitems(duetodomainconsiderations)thatwewouldliketoretaininthesyntheticdatasetswhilebreakingthecorrelationsamongotheritems.Agoodexamplearedatasetsthatrecordthepresenceorabsenceofageneticvariationatvariouslocationsonthegenome(items)acrossmultiplesubjects(transactions).Itemsrepresentinglocationsthatarecloseonthegeneticsequenceareknowntobehighlycorrelated.Thislocalstructureofcorrelationmaybelostinthesyntheticdatasetsifwetreateachcolumnidenticallywhilerandomizing.Whatisneededinthiscaseistokeepthelocalcorrelationbuttobreakcorrelationofareasthatarefurtheraway.

Afterconstructingsyntheticdatasetspertainingtothenullhypothesis,wecangeneratethenulldistributionofthesupportofanitemsetbyobservingitssupportinthesyntheticdatasets.Thisprocedurecanhelpindecidingsupportthresholdsusingstatisticalconsiderationssothatthediscoveredfrequentitemsetsarestatisticallysignificant.

10.5StatisticalTestingforClusterAnalysisThegoodnessofaclusteringistypicallyevaluatedwiththehelpofclustervaliditymeasuresthateithercapturethecohesionorseparationofclusters,suchasthesumofsquarederrors(SSE),ormakeuseofexternallabelssuchasentropy.Insomecases,theminimumandmaximumvaluesofmeasureshaveintuitiveinterpretationsthatcanbeusedtoexaminethegoodnessofaclustering.Forinstance,ifwearegiventhetrueclasslabelsofinstancesandwewantourclusteringtoreflecttheclassstructure,thenapurityof0isbad,whileapurityof1isgood.Likewise,anentropyof0isgood,asisanSSEof0.However,inmanycases,wearegivenintermediatevaluesofclustervaliditymeasureswhicharedifficulttointerpretdirectlywithoutthehelpofdomainconsiderations.

Statisticaltestingproceduresprovideausefulwayofmeasuringthesignificanceofadiscoveredclustering.Inparticular,wecanconsiderthenullhypothesisthatthereisnoclusterstructureamongtheinstancesandtheclusteringalgorithmisproducingarandompartitioningofthedata.Theapproachistousetheclustervaliditymeasureasateststatistic.Thedistributionofthatteststatisticundertheassumptionthatthedatahasnoclusteringstructureisthenulldistribution.Wecanthentestwhetherthevaliditymeasureactuallyobservedforthedataissignificant.Inthefollowing,weconsidertwogeneralcases:(1)theteststatisticisaninternalclusteringvalidityindexcomputedforunlabeleddata,suchasSSEorthesilhouettecoefficient,or(2)theteststatisticisanexternalindex,i.e.,theclusterlabelsaretobecomparedagainstclasslabels,suchasentropyorpurity.TheseclustervaliditymeasuresaredescribedinSection7.5 .

10.5.1GeneratingaNullDistributionforInternalIndices

Internalindicesmeasurethegoodnessofaclusteronlybyreferencetothedataitself—seeSection7.5.2 .Furthermore,oftentheclusteringisdrivenbyanobjectivefunction,andinthosecases,themeasureofaclustering’sgoodnessisprovidedbytheobjectivefunction.Thus,mostofthetime,statisticalevaluationofaclusteringisnotperformed.

Anotherreasonthatsuchanevaluationisnotperformedisthedifficultyingeneratinganulldistribution.Inparticular,togetameaningfulnulldistributionfordeterminingclusterstructure,weneedtocreatedatawithsimilaroverallpropertiesandcharacteristicsasthedatawehaveexceptthatithasnoclusterstructure.Butthiscanbedifficultsincedataoftenhasacomplexstructure,e.g.,thedependenciesamongobservationsintimeseriesdata.Nonetheless,statisticaltestingcanbeusefulifthedifficultiescanbeovercome.Wepresentasimpleexampletoillustratetheapproach.

Example10.10(SignificanceofSSE).ThisexampleisbasedonK-meansandtheSSE.Supposethatwewantameasureofhowthewell-separatedclustersofFigure10.11a comparewithrespecttorandomdata.Wegeneratemanyrandom(uniformlydistributed)setsof100pointshavingthesamerangeofvaluesalongthetwodimensionsasthepointsinthethreeclusters,findthreeclustersineachdatasetusingK-means,andaccumulatethedistributionofSSEvaluesfortheseclusterings.ByusingthisdistributionoftheSSEvalues,wecanthenestimatetheprobabilityoftheSSEvaluefortheoriginalclusters.Figure10.11b showsthehistogramoftheSSEfrom500

randomruns.ThelowestSSEinthehistogramis0.0173.ForthethreeclustersofFigure10.11a ,theSSEis0.0050.Wecouldthereforeconservativelyclaimthatthereislessthana1%chancethataclusteringsuchasthatofFigure10.11a couldoccurbychance.

Figure10.11.Usingrandomizationtoevaluatethep-valueforaclustering.

Inthepreviousexample,itwasrelativelystraightforwardtouserandomizationtoevaluatethestatisticalsignificanceofaninternalclustervaliditymeasure.Inpractice,domainevaluationisusuallymoreimportant.Forinstance,adocumentclusteringschemecouldbeevaluatedbylookingatthedocumentsandjudgingwhethertheclustersmakesense.Moregenerally,adomainexpertwouldevaluatetheclusterforsuitabilitytoadesiredapplication.Nonetheless,astatisticalevaluationofclusteringissometimesnecessary.AreferencetoanexampleforclimatetimeseriesisprovidedintheBibliographicNotes.

10.5.2GeneratingaNullDistributionforExternalIndices

Ifexternallabelsareusedforevaluation,thenaclusteringisevaluatedusingameasuresuchasentropyortheRandstatistic—seeSection7.5.7 whichassesseshowcloselytheclusterstructure,asreflectedintheclusterlabels,matchestheclasslabels.Someofthesemeasurescanbemodeledwithastatisticaldistribution,e.g.,theadjustedRandindex,whichisbasedonthemultivariatehypergeometricdistribution.Ifameasurehasawell-knowndistribution,thenthisdistributioncanbeusedtocomputeap-value.

However,randomizationcanalsobeusedtogenerateanulldistributioninthiscase,asfollows.

1. GenerateMrandomizedsetsoflabels,2. Foreachrandomizedsetoflabels,computethevalueoftheexternal

index.Let bethevalueoftheexternalindexobtainedfortherandomization.Let bethevalueoftheexternalindexfortheoriginalsetoflabels.

3. Assumingthatalargervalueoftheexternalindexismoredesirable,definethep-valueof tobethefractionof forwhich .

Aswiththecaseofunsupervisedevaluationofaclustering,domainsignificanceoftenassumesadominantrole.Forexample,considerclusteringnewarticlesintodistinctgroupsasinExample7.15 ,wherethearticlesbelongtotheclasses:Entertainment,Financial,Foreign,Metro,National,andSports.Ifwehavethesamenumberofclustersasthenumberofclassesof

L1,…,Li,…,LM

mi ithm0

m0 mi mi>m0

p-value(m0)= |{mi:mi>mo}|M (10.9)

newsarticles,thenanidealclusteringwouldhavetwocharacteristics.First,everyclusterwouldcontainonlydocumentsfromoneclass,i.e.,itwouldbepure.Second,everyclusterwouldcontainallofthedocumentsfromaparticularclass.Anactualclusteringofdocumentscanbestatisticallysignificant,butstillbequitepoorintermsofpurityand/orcontainingallthedocumentsofaparticulardocumentclass.Sometimes,suchsituationsarestillofinterest,aswedescribenext.

10.5.3Enrichment

Insomecasesinvolvinglabeleddata,thegoalofevaluatingclustersistofindclustersthathavemoreinstancesofaparticularclassthanwouldbeexpectedforarandomclustering.Whenaclusterhasmorethantheexpectednumberofinstancesofaspecificclass,wesaythattheclusterisenrichedinthatclass.Thisapproachiscommonlyusedintheanalysisofbioinformaticsdata,suchasgeneexpressiondata,butisapplicableinmanyotherareasaswell.Furthermore,thisapproachcanbeusedforanycollectionofgroups,notjustthosecreatedbyclustering.Weillustratethisapproachwithasimpleexample.

Example10.11(EnrichmentofNeighborhoodsofaCityinTermsofIncomeLevels).Assumethatinaparticularcitythereare10distinctneighborhoods,whichcorrespondtoclustersinourproblem.Overall,thereare10,000peopleinthecity.Further,assumethatthereare3incomelevels,Poor(30%),Medium(50%),andWealthy(20%).Finally,assumethatoneoftheneighborhoodshas1,000residents,23%ofwhomfallintotheWealthycategory.Thequestioniswhetherthisneighborhoodhasmorewealthy

peoplethanexpectedbyrandomchance.ThecontingencytableforthisexampleisshowninTable10.5 .WecananalyzethistablebyusingFisher’sexacttest.(SeeExample10.9 inSection10.4.1 .)

Table10.5.Beveragepreferencesamongagroupof1000people.

InNeighborhood

Wealthy 230

770

1770

7,230

2,000

8,000

20,000 1,000 10,000

UsingFisher’sexacttest,wefindthatthep-valueforthisresultis0.0076.Thiswouldseemtoindicatethatmorewealthypeopleliveinthisneighborhoodthanwouldbeexpectedbyrandomchanceatasignificancelevelof1%.However,severalpointsneedtobemade.First,wemayverywellbetestingeverygroupagainsteveryneighborhoodtolookforenrichment.Thus,therewouldbe30testsoverallandthep-valuesshouldbeadjustedformultiplecomparisons.Forinstance,ifweusetheBonferroniprocedure,0.0076wouldnotbeasignificantresultsincethesignificancethresholdisnow .Also,theoddsratioforthiscontingencytableisonly1.22.Hence,evenifthedifferenceissignificant,theactualmagnitudeofthedifferencedoesn’tseemverylarge,i.e.,veryfarfromanoddsratioof1.Inaddition,notethatmultiplyingalltheentriesofthetableby10willgreatlydecreasethep-value ,buttheoddsratiowillremainthesame.Despitetheseissues,enrichmentcanbeavaluabletoolandhasyieldedusefulresultsforavarietyofapplications.

InNeighborhood¯

Wealthy¯

0.01/30=0.0003

(≈10−9)

10.6StatisticalTestingforAnomalyDetectionAnomalydetectionalgorithmstypicallyproduceoutputsintheformofclasslabels(whenaclassificationmodelistrainedoverlabeledanomalies)oranomalyscores.Statisticalconsiderationscanbeusedtoensurethevalidityofboththesetypesofoutputsasdescribedinthefollowing.

SupervisedAnomalyDetectionIfwehaveaccesstolabeledanomalousinstances,theproblemofanomalydetectioncanbeconvertedtoabinaryclassificationproblem,wherethenegativeclasscorrespondstothenormaldatainstances,whilethepositiveclasscorrespondstotheanomalousinstances.StatisticaltestingproceduresdiscussedinSection10.3 forclassificationaredirectlyrelevantforavoidingfalsediscoveriesinsupervisedanomalydetection,albeitwiththeadditionalchallengesofbuildingamodelforimbalancedclasses(SeeSection4.11 .)Inparticular,weneedtoensurethattheclassificationerrormetricusedduringstatisticaltestingissensitivetotheimbalanceamongtheclassesandgivesenoughemphasistotheerrorsrelatedtotherareanomalyclass(falsepositivesandfalsenegatives).Afterlearningavalidclassificationmodel,wecanalsousestatisticalmethodstocapturetheuncertaintyintheoutputsofthemodelonunseeninstances.Forexample,wecanuseresamplingapproachessuchasthebootstrappingtechniquetolearnmultipleclassificationmodelsfromthetrainingset,andthedistributionoftheirlabelsproducedonanunseeninstancecanbeusedtoestimateconfidenceintervalsofthetrueclasslabeloftheinstance.

UnsupervisedAnomalyDetectionMostunsupervisedanomalydetectionapproachesproduceananomalyscoreondatainstancestoindicatehowanomalousaninstanceiswithrespecttothenormalclass.Itisthenimportanttodecideasuitablethresholdontheanomalyscoretoidentifyinstancesthataresignificantlyanomalousandhenceareworthyoffurtherinvestigation.Thechoiceofathresholdisgenerallyspecifiedbytheuserbasedondomainconsiderationsonwhatisacceptableasasignificantdeparturefromthenormalbehavior.Suchdecisionscanalsobereinforcedwiththehelpofstatisticaltestingmethods.

Inparticular,fromastatisticalperspective,wecanconsidereveryinstancetobearesultanditsanomalyscoretobetheteststatistic.Thenullhypothesisisthattheinstancebelongstothenormalclasswhilethealternativehypothesisisthattheinstanceissignificantlydifferentfromotherpointsfromthenormalclassandhenceisananomaly.Hence,giventhenulldistributionoftheanomalyscore,wecancomputethep-valueofeveryresultandusethisinformationtodeterminestatisticallysignificantanomalies.

Aprimerequirementforperformingstatisticaltestingforanomalydetectionistoobtainthedistributionofanomalyscoresforinstancesthatbelongtothenormalclass,asthisisthenulldistribution.Iftheanomalydetectionapproachisbasedonstatisticaltechniques(seeSection9.3 ),wehaveaccesstoastatisticalmodelforestimatingthedistributionofthenormalclass.Inothercases,wecanuserandomizationmethodstogeneratesyntheticdatasetswheretheinstancesonlybelongtothenormalclass.Forexample,ifitispossibletoconstructamodelofthedatawithoutanomalies,thenthismodelcanbeusedtogeneratemultiplesamplesofthedata,andinturn,thosesamplescanbeusedtocreateadistributionoftheanomalyscoresforinstancesthatarenormal.Unfortunately,justasforgeneratingsyntheticdataforclustering,thereisusuallynoeasywaytoconstructrandomdatasetsthat

looksimilartotheoriginaldatainallrespectsexceptthattheycontainonlynormalinstances.

Ifanomalydetectionistobeuseful,however,thenatsomepoint,theresultsoftheanomalydetection,particularlythetoprankinganomalies,needtobeevaluatedbydomainexpertstoassesstheperformanceofthealgorithm.Iftheanomaliesproducedbythealgorithmdonotagreewiththeexpertassessment,thisdoesnotnecessarilymeanthatthealgorithmisnotperformingwell.Instead,itmayjustmeanthatthedefinitionofananomalybeingusedbytheexpertandthealgorithmdiffer.Forinstance,theexpertmayviewcertainaspectsofthedataasirrelevant,butthealgorithmmaybetreatingthemasimportant.Insuchcases,theseaspectsofthedatacanbedeemphasizedtohelprefinethestatisticaltestingprocedures.Alternatively,theremaybenewtypesofanomaliesthattheexpertisunfamiliarwith,sinceanomaliesare,bytheirverynature,supposedtobesurprising.

BaseRateFallacyConsiderananomalydetectionsystemthatcanaccuratelydetect99.9%ofthefraudulentcreditcardtransactionswithafalsealarmrateofonly0.01%.Ifatransactionisflaggedasananomalybythesystem,howlikelyitistobegenuinelyfraudulent?Acommonmisconceptionisthatthemajorityofthedetectedanomaliesarefraudulenttransactionsgiventhehighdetectionrateandlowfalsealarmrateofthesystem.However,thiscanbemisleadingiftheskewofthedataisnottakenintoconsideration.Thisproblemisalsoknownasbaseratefallacyorbaserateneglect.

Toillustratetheproblem,considerthecontingencytableshowninTable10.6 .Letdbethedetectionrate(i.e.,truepositiverate)ofthesystemandfbeitsfalsealarmrate,ortobemorespecific

Table10.6.Contingencytableforananomalydetectionsystemwithdetectionratedandfalsealarmratef.

Alarm NoAlarm

Fraud

NoFraud

dαN αN

N

Ourgoalistocalculatetheprecisionofthesystem,i.e.,P(Fraud|Alarm).Iftheprecisionishigh,thenthemajorityofthealarmsareindeedtriggeredbyfraudulenttransactions.BasedontheinformationgiveninTable10.6 ,theprecisionofthesystemcanbecalculatedasfollows:

whereαisthepercentageoffraudulenttransactionsinthedata.Sinceand ,theprecisionofthesystemis

Ifthedataisnotskewed,e.g.,when ,thenitsprecisionwouldbeveryhigh,0.9999,sowecantrustthatthemajorityoftheflaggedtransactionsarefraudulent.However,ifthedataishighlyskewed,e.g.,when (oneinfiftythousandtransactions),thentheprecisionisonly0.167,whichmeansthatonlyaboutoneissixalarmsisatrueanomalies.

f(1−α)N

(1−d)αN

(1−f)(1−α)N (1−α)N

dαN+f(1−α)N (1−d)αN+(1−f)(1−α)N

P(Alarm|Fraud)=dandP(Alarm|NotFraud)=f.

Precision=dαNdαN+f(1−α)N=dαf+(d−f)α, (10.10)

d=0.999 f=0.0001

Precision=0.999α0.0001+0.9989α (10.11)

α=0.5

α=2×10−5

Theprecedingexampleillustratestheimportanceofconsideringskewnessofthedatawhenchoosinganappropriateanomalydetectionsystemforagivenapplication.Iftheeventofinterestoccursrarely,say,oneinfiftythousandofthepopulation,thenevenasystemwith99.9%detectionrateand0.01%falsealarmratecanstillmake5mistakesforevery6anomaliesflaggedbythesystem.Theprecisionofthesystemdegradessignificantlyasthepercentageofskewnessinthedataincreases.Thecruxofthisproblemliesinthefactthatdetectionrateandfalsealarmratearemetricsthatarenotsensitivetoskewnessintheclassdistribution,aproblemthatwasfirstalludedtoinSection4.11 duringourdiscussionontheclassimbalancedproblem.Thelessonhereisthatanyevaluationofananomalydetectionsystemmusttakeintoaccountthedegreeofskewnessinthedatabeforedeployingthesystemintopractice.

10.7BibliographicNotesRecently,therehasbeenagrowingbodyofliteraturethatisconcernedwiththevalidityandreproducibilityofresearchresults.Perhapsthemostwell-knownworkinthatareaisthepaperbyIoannidis[721],whichassertsthatmostpublishedresearchfindingsarefalse.Therehavebeenvariouscritiquesofthiswork,e.g.,seeGoodmanandGreenland[717]andIoannidisrebuttal[716,722].Regardless,concernaboutthevalidityandreproducibilityofresultshasonlycontinuedtoexpand.ApaperbySimmonsetal.[742]statesthatalmostanyeffectinpsychologycanbepresentedasstatisticallysignificantgivencurrentpractice.Thepaperalsosuggestsrecommendedchangesinresearchpracticeandarticlereview.ANaturesurveybyBaker[697]reportedthatmorethan70%ofresearchershavetriedandfailedtoreplicateotherresearchers’results,and50%havefailedtoreplicatetheirownresults.Onamorepositivenote,JagerandLeek[724]lookedatpublishedmedicalresearchandalthoughtheyidentifiedaneedforimprovements,concludedthat“ouranalysissuggeststhatthemedicalliteratureremainsareliablerecordofscientificprogress.”TherecentbookbyNateSilver[741]hasdiscussedanumberofpredictivefailuresinvariousareas,includingbaseball,politics,andeconomics.Althoughnumerousotherstudiesandreferencescanbecitedinanumberofareas,thekeypointisthatthereisawidespreadperception,backedbyafairamountofevidence,thatmanycurrentdataanalysesarenottrustworthyandthattherearevariousstepsthatcanbetakentoimprovethesituation[699,723,729].Althoughthischapterhasfocusedonstatisticalissues,manyofthechangesadvocated,e.g.,byIoannidisinhisoriginalpaper,arenotstatisticalinnature.

ThenotionofsignificancetestingwasintroducedbytheprominentstatisticianRonaldFisher[710,734].Inresponsetoperceivedshortcomings,Neyman

andPearson[735,736]introducedhypothesistesting.Thetwoapproacheshaveoftenbeenmergedinanapproachknownasnullhypothesisstatisticaltesting(NHST)[731],whichhasbeenthesourceofmanyproblems[712,720].Anumberofp-valuemisconceptionsaresummarizedinvariousrecentpapers,forexample,thosebyGoodman[715],Nuzzo[737],andGelman[711].TheAmericanStatisticalAssociationhasrecentlyissuedastatementonp-values[751].PapersthatdescribetheBayesianapproach,asexemplifiedbytheBayesfactorandpriorodds,areKassandRaftery[727]andGoodmanandSander[716].ArecentpaperbyBenjaminandlargenumberofotherprominentstatisticians[699],usessuchanapproachtoarguethat0.005,insteadof0.05,shouldbethedefaultp-valuethresholdforstatisticalsignificance.Moregenerally,themisinterpretationandmisuseofp-valuesisnottheonlyproblemassomehavenoted[730].NotethatbothFisher’ssignificancetestingandtheNeyman-Pearsonhypothesistestingapproachesweredesignedwithstatisticallydesignedexperimentsinmind,butareoften,perhapsmostly,appliedtoobservationaldata.Indeed,mostdatabeinganalyzednowadaysisobservationaldata.

TheseminalpaperforthefalsediscoveryrateisbyBenjaminiandHochberg[701].ThepositivefalsediscoveryratewasproposedbyStorey[743–745].Efronhasadvocatedtheuseofthelocalfalsediscoveryrate[704–707].TheworkofEfron,Storey,Tibshirani,andothershasbeenappliedinasoftwarepackageforanalyzingmicroarrydata:SAM:SignificanceAnalysisofMicroarrays[707,746,750].Moregenerally,mostmathematicalandstatisticalsoftwarehaspackagesforcomputingFDR.Inparticular,seetheFDRtoolinRbyStrimmer[748,749]ortheq-valueroutine[698,747],whichisavailableinBioconductor,awell-knownRpackage.Arecentsurveyofpastandcurrentworkinmultiplehypothesistesting(multiplecomparison)isgivenbyBenjamini[700].

AsdiscussedinSection10.2 ,resamplingapproaches,especiallythosebasedontherandomization/permutationandthebootstrap/cross-validation,arethemainapproachtomodelingthenulldistributionorthedistributionsofevaluationmetrics,andthus,computingevaluationmeasuresofinterest,suchasp-values,falsediscoveryrates,andconfidenceintervals.Discussionandreferencestothebootstrapandcross-validationareprovidedintheBibliographicNotesofChapter3 .Generalresourcesonpermutation/randomizationincludebooksbyEdgingtonandOnghena[703],Good[714],andPesarinandLuigi[740],aswellasthearticlesbyCollingridge[702],Ernst[709]andWelch[756].Althoughsuchtechniquesarewidelyused,therearelimitations,suchasthosediscussedinsomedetailbyEfron[705].Inthispaper,EfrondescribesaBayesianapproachforestimatinganempiricalnulldistributionandusingittocomputea“local”falsediscoveryratethatismoreaccuratethanapproachesusinganulldistributionbasedonrandomizationortheoreticalapproaches.

Aswehaveseenintheapplicationspecificsections,differentareasofdataanalysistendtouseapproachesspecifictotheirproblem.Thepermutation(randomization)ofclasslabelsdescribedinSection10.3.1 isastraightforwardandwell-knowntechniqueinclassification.ThepaperbyOjalaandGarigga[738]examinesthisapproachinmoredepthandpresentsanalternativerandomizationapproachthatcanhelpidentify,foragivendataset,whetherdependencyamongfeaturesisimportantintheclassificationperformance.ThepaperbyJensenandCohen[726]isarelevantreferenceforthediscussionofmultiplehypothesistestinginmodelselection.Clusteringhasrelativelylittleworkintermsofstatisticalvalidationsincemostusersrelyonmeasuresofclusteringgoodnesstoevaluateoutcomes.However,someusefulresourcesareChapter4 ofJainandDubes’clusteringbook[725]andtherecentsurveyofclusteringvaliditymeasuresbyXiongandLi.[757].TheswaprandomizationapproachwasintroducedintoassociationanalysisbyGionisetal.[713].Thispaperhasanumberofreferencesthattracethe

originofthisapproachinotherareas,aswellasreferencestootherpapersfortheassessmentofassociationpatterns.Thisworkwasextendedtoreal-valuedmatricesbyOjalaetal.[739].AnotherimportantresourceforstatisticallysoundassociationpatterndiscoveryistheworkofWebb[752–755].HӓmӓlӓinenandWebbtaughtatutorialinKDD2014,StatisticallySoundPatternDiscovery.RelevantpublicationsbyHӓmӓlӓineninclude[719]and[718].

Thedesignofexperimentstoreducevariabilityandincreasepowerisacorecomponentofstatistics.Thereareanumberofgeneralbooksonthetopic,e.g.,theonebyMontgomery[732],butmanymorespecializedtreatmentsofthetopicareavailableforvariousdomains.Inrecentyears,A/Btestinghasemergedasacommontoolofcompaniesforcomparingtwoalternatives,e.g.,twowebpages.ArecentpaperbyKohavietal.[728]providesasurveyandpracticalguidetoA/Btestingandsomeofitsvariants.

Muchofthematerialpresentedinsections10.1 and10.2 iscoveredinvariousstatisticsbooksandarticles,manyofwhichwerementionedpreviously.Additionalreferencematerialforsignificanceandhypothesistestingcanbefoundinintroductorytexts,althoughasmentionedabove,thesetwoapproachesarenotalwaysclearlydistinguished.Theuseofhypothesistestingiswidespreadinanumberofdomains,e.g.,medicine,sincetheapproachallowinvestigatorstodeterminehowmanysampleswillbeneededtoachievecertaintargetvaluesfortheTypeIerror,power,andeffectsize.See,forexample,Ellis[708]andMurphyetal.[733].

Bibliography[697]M.Baker.1,500scientistsliftthelidonreproducibility.Nature,

533(7604):452–454,2016.

[698]D.Bass,A.Dabney,andD.Robinson.qvalue:Q-valueestimationforfalsediscoveryratecontrol.Rpackage,2012.

[699]D.J.Benjamin,J.Berger,M.Johannesson,B.A.Nosek,E.-J.Wagenmakers,R.Berk,K.Bollen,B.Brembs,L.Brown,C.Camerer,etal.Redefinestatisticalsignificance.PsyArXiv,2017.

[700]Y.Benjamini.Simultaneousandselectiveinference:currentsuccessesandfuturechallenges.BiometricalJournal,52(6):708–721,2010.

[701]Y.BenjaminiandY.Hochberg.Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.Journaloftheroyalstatisticalsociety.SeriesB(Methodological),pages289–300,1995.

[702]D.S.Collingridge.Aprimeronquantitizeddataanalysisandpermutationtesting.JournalofMixedMethodsResearch,7(1):81–97,2013.

[703]E.EdgingtonandP.Onghena.Randomizationtests.CRCPress,2007.

[704]B.Efron.Localfalsediscoveryrates.DivisionofBiostatistics,StanfordUniversity,2005.

[705]B.Efron.Large-scalesimultaneoushypothesistesting.JournaloftheAmericanStatisticalAssociation,2012.

[706]B.Efronetal.Microarrays,empiricalBayesandthetwo-groupsmodel.Statisticalscience,23(1):1–22,2008.

[707]B.Efron,R.Tibshirani,J.D.Storey,andV.Tusher.EmpiricalBayesanalysisofamicroarrayexperiment.JournaloftheAmericanstatisticalassociation,96(456):1151–1160,2001.

[708]P.D.Ellis.Theessentialguidetoeffectsizes:Statisticalpower,meta-analysis,andtheinterpretationofresearchresults.CambridgeUniversityPress,2010.

[709]M.D.Ernstetal.Permutationmethods:abasisforexactinference.StatisticalScience,19(4):676–685,2004.

[710]R.A.Fisher.Statisticalmethodsforresearchworkers.InBreakthroughsinStatistics,pages66–70.Springer,1992(originally,1925).

[711]A.Gelman.Commentary:Pvaluesandstatisticalpractice.Epidemiology,24(1):69–72,2013.

[712]G.Gigerenzer.Mindlessstatistics.TheJournalofSocio-Economics,33(5):587–606,2004.

[713]A.Gionis,H.Mannila,T.Mielikäinen,andP.Tsaparas.Assessingdataminingresultsviaswaprandomization.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),1(3):14,2007.

[714]P.Good.Permutationtests:apracticalguidetoresamplingmethodsfortestinghypotheses.SpringerScience&BusinessMedia,2013.

[715]S.Goodman.Adirtydozen:twelvep-valuemisconceptions.InSeminarsinhematology,volume45(13),pages135–140.Elsevier,2008.

[716]S.GoodmanandS.Greenland.AssessingtheUnreliabilityoftheMedicalLiterature:AresponsetoWhyMostPublishedResearchFindingsareFalse.bepress,2007.

[717]S.GoodmanandS.Greenland.Whymostpublishedresearchfindingsarefalse:problemsintheanalysis.PLoSMed,4(4):e168,2007.

[718]W.Hämäläinen.Efficientsearchforstatisticallysignificantdependencyrulesinbinarydata.PhDThesis,DepartmentofComputerScience,UniversityofHelsinki,2010.

[719]W.Hämäläinen.Kingfisher:anefficientalgorithmforsearchingforbothpositiveandnegativedependencyruleswithstatisticalsignificancemeasures.Knowledgeandinformationsystems,32(2):383–414,2012.

[720]R.Hubbard.AlphabetSoup:BlurringtheDistinctionsBetweenpsandα’sinPsychologicalResearch.Theory&Psychology,14(3):295–327,2004.

[721]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse.PLoSMed,2(8):e124,2005.

[722]J.P.Ioannidis.Whymostpublishedresearchfindingsarefalse:author’sreplytoGoodmanandGreenland.PLoSmedicine,4(6):e215,2007.

[723]J.P.Ioannidis.Howtomakemorepublishedresearchtrue.PLoSmedicine,11(10):e1001747,2014.

[724]L.R.JagerandJ.T.Leek.Anestimateofthescience-wisefalsediscoveryrateandapplicationtothetopmedicalliterature.Biostatistics,15(1):1–12,2013.

[725]A.K.JainandR.C.Dubes.AlgorithmsforClusteringData.PrenticeHallAdvancedReferenceSeries.PrenticeHall,March1988.

[726]D.JensenandP.R.Cohen.MultipleComparisonsinInductionAlgorithms.MachineLearning,38(3):309–338,March2000.

[727]R.E.KassandA.E.Raftery.Bayesfactors.Journaloftheamericanstatisticalassociation,90(430):773–795,1995.

[728]R.Kohavi,A.Deng,B.Frasca,T.Walker,Y.Xu,andN.Pohlmann.Onlinecontrolledexperimentsatlargescale.InProceedingsofthe19thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages1168–1176.ACM,2013.

[729]D.Lakens,F.G.Adolfi,C.Albers,F.Anvari,M.A.Apps,S.E.Argamon,M.A.vanAssen,T.Baguley,R.Becker,S.D.Benning,etal.JustifyYourAlpha:AResponsetoRedefineStatisticalSignificance.PsyArXiv,2017.

[730]J.T.LeekandR.D.Peng.Statistics:Pvaluesarejustthetipoftheiceberg.Nature,520(7549):612,2015.

[731]E.F.Lindquist.Statisticalanalysisineducationalresearch.HoughtonMifflin,1940.

[732]D.C.Montgomery.Designandanalysisofexperiments.JohnWiley&Sons,2017.

[733]K.R.Murphy,B.Myors,andA.Wolach.Statisticalpoweranalysis:Asimpleandgeneralmodelfortraditionalandmodernhypothesistests.Routledge,2014.

[734]J.Neyman.RAFisher(1890–1962):AnAppreciation.Science,156(3781):1456–1460,1967.

[735]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartI.Biometrika,pages

175–240,1928.

[736]J.NeymanandE.S.Pearson.Ontheuseandinterpretationofcertaintestcriteriaforpurposesofstatisticalinference:PartII.Biometrika,pages263–294,1928.

[737]R.Nuzzo.Scientificmethod:Statisticalerrors.NatureNews,Feb.122014.

[738]M.OjalaandG.C.Garriga.Permutationtestsforstudyingclassifierperformance.JournalofMachineLearningResearch,11(Jun):1833–1863,2010.

[739]M.Ojala,N.Vuokko,A.Kallio,N.Haiminen,andH.Mannila.Randomizationofreal-valuedmatricesforassessingthesignificanceofdataminingresults.InProceedingsofthe2008SIAMInternationalConferenceonDataMining,pages494–505.SIAM,2008.

[740]F.PesarinandL.Salmaso.Permutationtestsforcomplexdata:theory,applicationsandsoftware.JohnWiley&Sons,2010.

[741]N.Silver.Thesignalandthenoise:Whysomanypredictionsfail-butsomedon’t.Penguin,2012.

[742]J.P.Simmons,L.D.Nelson,andU.Simonsohn.False-positivepsychologyundisclosedflexibilityindatacollectionandanalysisallows

presentinganythingassignificant.Psychologicalscience,page0956797611417632,2011.

[743]J.D.Storey.Adirectapproachtofalsediscoveryrates.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),64(3):479–498,2002.

[744]J.D.Storey.Thepositivefalsediscoveryrate:aBayesianinterpretationandtheq-value.Annalsofstatistics,pages2013–2035,2003.

[745]J.D.Storey,J.E.Taylor,andD.Siegmund.Strongcontrol,conservativepointestimationandsimultaneousconservativeconsistencyoffalsediscoveryrates:aunifiedapproach.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology),66(1):187–205,2004.

[746]J.D.StoreyandR.Tibshirani.SAM:thresholdingandfalsediscoveryratesfordetectingdifferentialgeneexpressioninDNAmicroarrays.InTheanalysisofgeneexpressiondata,pages272–290.Springer,2003.

[747]J.D.Storey,W.Xiao,J.T.Leek,R.G.Tompkins,andR.W.Davis.Significanceanalysisoftimecoursemicroarrayexperiments.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,102(36):12837–12842,2005.

[748]K.Strimmer.fdrtool:aversatileRpackageforestimatinglocalandtailarea-basedfalsediscoveryrates.Bioinformatics,24(12):1461–1462,2008.

[749]K.Strimmer.Aunifiedapproachtofalsediscoveryrateestimation.BMCbioinformatics,9(1):303,2008.

[750]V.G.Tusher,R.Tibshirani,andG.Chu.Significanceanalysisofmicroarraysappliedtotheionizingradiationresponse.ProceedingsoftheNationalAcademyofSciences,98(9):5116–5121,2001.

[751]R.L.WassersteinandN.A.Lazar.TheASA’sstatementonp-values:context,process,andpurpose.TheAmericanStatistician,2016.

[752]G.I.Webb.Discoveringsignificantpatterns.MachineLearning,68(1):1–33,2007.

[753]G.I.Webb.Layeredcriticalvalues:apowerfuldirect-adjustmentapproachtodiscoveringsignificantpatterns.MachineLearning,71(2):307–323,2008.

[754]G.I.Webb.Self-sufficientitemsets:Anapproachtoscreeningpotentiallyinterestingassociationsbetweenitems.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),4(1):3,2010.

[755]G.I.WebbandJ.Vreeken.Efficientdiscoveryofthemostinterestingassociations.ACMTransactionsonKnowledgeDiscoveryfromData(TKDD),8(3):15,2014.

[756]W.J.Welch.Constructionofpermutationtests.JournaloftheAmericanStatisticalAssociation,85(411):693–698,1990.

[757]H.XiongandZ.Li.ClusteringValidationMeasures.InC.C.AggarwalandC.K.Reddy,editors,DataClustering:AlgorithmsandApplications,pages571–605.Chapman&Hall/CRC,2013.

10.8Exercises1.Statisticaltestingproceedsinamanneranalogoustothemathematicalprooftechnique,proofbycontradiction,whichprovesastatementbyassumingitisfalseandthenderivingacontradiction.Compareandcontraststatisticaltestingandproofbycontradiction.

2.Whichofthefollowingaresuitablenullhypotheses.Ifnot,explainwhy.

a. ComparingtwogroupsConsidercomparingtheaveragebloodpressureofagroupofsubjects,bothbeforeandaftertheyareplacedonalowsaltdiet.Inthiscase,thenullhypothesisisthatalowsaltdietdoesreducebloodpressure,i.e.,thattheaveragebloodpressureofthesubjectsisthesamebeforeandafterthechangeindiet.

b. ClassificationAssumetherearetwoclasses,labeled+and−,wherewearemostinterestedinthepositiveclass,e.g.,thepresenceofadisease.

isthestatementthattheclassofanobjectisnegative,i.e.,thatthepatientdoesnothavethedisease.

c. AssociationAnalysisForfrequentpatterns,thenullhypothesisisthattheitemsareindependentandthus,anypatternthatwedetectisspurious.

d. ClusteringThenullhypothesisisthatthereisclusterstructureinthedatabeyondwhatmightoccuratrandom.

e. AnomalyDetectionOurassumption, ,isthatanobjectisnotanomalous.

3.Consideronceagainthecoffee-teaexample,presentedinExample10.9 .Thefollowingtwotablesarethesameastheonepresentedin

H0

H0

Example10.9 exceptthateachentryhasbeendividedby10(lefttable)ormultipliedby10(righttable).

Table10.7.Beveragepreferencesamongagroupof100people(left)and10,000people(right).

Coffee

Tea 15

65

5

15

20

80

80 20 100

Coffee

Tea 1500

6500

500

1500

2000

8000

8000 2000 10000

a. Computethep-valueoftheobservedsupportcountforeachtable,i.e.,for15and1500.Whatpatterndoyouobserveasthesamplesizeincreases?

b. ComputetheoddsratioandinterestfactorforthetwocontingencytablespresentedinthisproblemandtheoriginaltableofExample10.9 .(SeeSection5.7.1 fordefinitionsofthesetwomeasures.)Whatpatterndoyouobserve?

c. Theoddsratioandinterestfactoraremeasuresofeffectsize.Arethesetwoeffectsizessignificantfromapracticalpointofview?

d. Whatwouldyouconcludeabouttherelationshipbetweenp-valuesandeffectsizeforthissituation?

4.Considerthedifferentcombinationsofeffectsizeandp-valueappliedtoanexperimentwherewewanttodeterminetheefficacyofanewdrug.

Coffee¯

Tea¯

Coffee¯

Tea¯

effectsizesmall,p-valuesmall

effectsizesmall,p-valuelarge(ii)

Whethereffectsizeissmallorlargedependsonthedomain,whichinthiscaseismedical.Forthisproblemconsiderasmallp-valuetobelessthan0.001,whilealargep-valueisabove0.05.Assumethatthesamplesizeisrelativelylarge,e.g.,thousandsofpatientswiththeconditionthatthedrughopestotreat.

a. Whichcombination(s)wouldverylikelybeofinterest?

b. Whichcombinations(s)wouldverylikelynotbeofinterest?

c. Ifthesamplesizeweresmall,wouldthatchangeyouranswers?

5.ForNeyman-Pearsonhypothesistesting,weneedtobalancethetradeoffbetweenα,theprobabilityofatypeIerrorandpower,i.e., ,whereβistheprobabilityofatypeIIerror.Computeα,β,andthepowerforthecasesgivenbelow,wherewespecifythenullandalternativedistributionsandtheaccompanyingcriticalregion.AlldistributionsareGaussianwithsomespecifiedmeanuandstandarddeviationσ,i.e., .LetTbetheteststatistic.

effectsizelarge,p-valuesmall

effectsizelarge,p-valuelarge

(iii)

(iv)

1−β

N(μ,σ)

,criticalregion: .

,criticalregion: .

,criticalregion: .

,criticalregion: .

,criticalregion: .

,criticalregion: .

H0:N(0,1),H1:N(3,1) T>2

H0:N(0,1),H1:N(3,1) |T|>2

H0:N(−1,1),H1:N(3,1) T>1

H0:N(−1,1),H1:N(3,1) |T|>1

H0:N(−1,0.5),H1:N(3,0.5) T>1

H0:N(−1,0.5),H1:N(3,0.5) |T|>1

6.Ap-valuemeasurestheprobabilityoftheresultgiventhatthenullhypothesisistrue.However,manypeoplewhocalculatep-valueshaveuseditastheprobabilityofthenullhypothesisgiventheresult,whichiserroneous.ABayesianapproachtothisproblemissummarizedbyEquation10.12 .

Thisapproachcomputestheratiooftheprobabilityofthealternativeandnullhypotheses( and ,respectively)giventheobservedoutcome, .Inturn,thisquantityisexpressedastheproductoftwofactors:theBayesfactorandthepriorodds.Theprioroddsistheratiooftheprobabilityof totheprobabilityof basedonpriorinformationabouthowlikelywebelieveeachhypothesisis.Usually,theprioroddsisestimateddirectlybasedonexperience,Forexample,indrugtestinginthelaboratory,itmaybeknownthatmostdrugsdonotproducepotentiallytherapeuticeffects.TheBayesfactoristheratiooftheprobabilityorprobabilitydensityoftheobservedoutcome, ,under and .Thisquantityiscomputedandrepresentsameasureofhowmuchmoreorlesslikelytheobservedresultisunderthealternativehypothesisthanthenullhypothesis.Conceptually,thehigheritis,themorewewouldtendtopreferthealternativetothenull.ThehighertheBayesfactor,thestrongertheevidenceprovidedbythedatafor .Moregenerally,thisapproachcanbeappliedtoassesstheevidenceforanyhypothesisversusanother.Thus,therolesof canbe(andoftenare)reversedinEquation10.12 .

a. SupposethattheBayesfactoris20,whichisverystrong,buttheprioroddsare0.01.Wouldyoubeinclinedtopreferthealternativeornullhypothesis?

b. Supposetheprioroddsare0.25,thenulldistributionisGaussianwithdensitygivenby ,andthealternativedistributionisgivenby

.ComputetheBayesfactorandposterioroddsof forthe

P(H1|xobs)P(H0|xobs)=f(H1|xobs)f(H0|xobs)×P(H1)P(H0)posterioroddsof H1=(10.12)

H1 H0 xobs

H1H0

xobs H1 H0

H1

H0

f0(x)=N(0,2)f1(x)=N(3,1) H1

followingvaluesof :2,2.5,3,3.5,4,4.5,5.Explainthepatternthatyouseeinbothquantities.

7.Considertheproblemofdeterminingwhetheracoinisafairone,i.e.,,byflippingthecoin10times.Usethebinomialtheorem

andbasicprobabilitytoanswerthefollowingquestions.

8.Algorithm10.1 on773providesanmethodforcalculatingthefalsediscoveryrateusingthemethodadvocatedbyBenjaminiandHochberg.Thedescriptioninthetextispresentedintermsoforderingthep-valuesandadjustingthesignificanceleveltoassesswhetherap-valueissignificant.Anotherwaytointerpretthismethodisintermsoforderingthep-values,smallesttolargest,andcomputingadjusted”p-values, ,whereiidentifiesthe smallestp-valueandmisthenumberofp-values.Thestatisticalsignificanceisdeterminedbasedonwhether ,whereαisthedesiredfalsediscoveryrate.

xobs

P(heads)=P(tails)=0.5

Acoinisflippedtentimesanditcomesupheadseverytime.Whatistheprobabilityofgetting10headsinarowandwhatwouldyouconcludeaboutwhetherthecoinisfair?

Suppose10,000coinsareeachflipped10timesinarowandtheflipsof10coinsresultinallheads,canyouconfidentlysaythatthesecoinsarenotfair?

Whatcanyouconcludeaboutresultswhenevaluatedindividuallyversusinagroup?

Supposethatyouflipeachcoin20timesandthenevaluate10,000coins.Canyounowconfidentlysaythatanycoinwhichyieldsallheadsisnotfair?

p′i=pi*m/iith

p′i≤α

a. Computetheadjustedp-valuesforthep-valuesinTable10.8 .Notethattheadjustedp-valuesmaynotbemonotonic.Inthatcase,anadjustedp-valuethatislargerthanitssuccessorischangedtohavethesamevalueasitssuccessor.

Table10.8.OrderedCollectionofp-values..

1 2 3 4 5 6 7 8 9 10

originalp-values

0.001 0.005 0.05 0.065 0.15 0.21 0.25 0.3 0.45 0.5

b. IfthedesiredFDRis20%,i.e., ,thenforwhichp-valuesisrejected?

c. SupposethatweusetheBonferroniprocedureinstead.Fordifferentvaluesofα,namely0.01,0.05,and0.10,computethemodifiedp-valuethreshold, ,thattheBonferroniprocedurewillusetoevaluatep-values.Thendetermine,foreachvalueof ,forwhichp-values, willberejected.(Ifap-valueequalsthethreshold,itisrejected.)

9.Thepositivefalsediscoveryrate(pFDR)issimilartothefalsediscoveryratedefinedinSection10.1.3 butassumesthatthenumberoftruepositivesisgreaterthan0.CalculationofthepFDRissimilartothatofFDR,butrequiresanassumptiononthevalueof ,thenumberofresultsthatsatisfythenullhypothesis.ThepFDRislessconservativethanFDR,butmorecomplicatedtocompute.

ThepositivefalsediscoveryratealsoallowsthedefinitionofanFDRanalogueofthep-value.Theq-valueistheexpectedfractionofhypothesesthatwillbefalseifthegivenhypothesisisaccepted.Specifically,theq-valueassociatedwithap-valueistheexpectedproportionoffalsepositivesamongallhypothesesthataremoreextreme,i.e.,havealowerp-value.Thus,theq-

α=0.20 H0

α*=α/10α* H0

m0

valueassociatedwithap-valueisthepositivefalsediscoveryratethatwouldresultifthep-valuewasusedasthethresholdforrejection.

Belowweshow50p-values,theirBenjamini-Hochbergadjustedp-values,andtheirq-values.

p-values

BHadjustedp-values

q-Values

0.0000 0.0000 0.0002 0.0004 0.0004 0.0010 0.0089 0.0089 0.0288 0.0479

0.0755 0.0755 0.0755 0.1136 0.1631 0.2244 0.2964 0.3768 0.3768 0.3768

0.4623 0.4623 0.4623 0.5491 0.5491 0.6331 0.7107 0.7107 0.7107 0.7107

0.7107 0.8371 0.9201 0.9470 0.9470 0.9660 0.9660 0.9660 0.9790 0.9928

0.9928 0.9928 0.9928 0.9960 0.9960 0.9989 0.9989 0.9995 0.9999 1.0000

0.0000 0.0000 0.0033 0.0040 0.0040 0.0083 0.0556 0.0556 0.1600 0.2395

0.2904 0.2904 0.2904 0.4057 0.5437 0.7012 0.8718 0.9420 0.9420 0.9420

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

0.0000 0.0000 0.0023 0.0033 0.0033 0.0068 0.0454 0.0454 0.1267 0.1861

0.2509 0.2509 0.2509 0.3351 0.4198 0.4989 0.5681 0.6257 0.6257 0.6257

0.6723 0.6723 0.6723 0.7090 0.7090 0.7375 0.7592 0.7592 0.7592 0.7592

a. Howmanyp-valuesareconsideredsignificantusingBHadjustedp-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?

b. Howmanyp-valuesareconsideredsignificantusingq-valuesandthresholdsof0.05,0.10,0.15,0.20,0.25,and0.30?

c. Comparethetwosetsofresults.

10.Analternativetothedefinitionsofthefalsediscoveryratediscussedsofaristhelocalfalsediscoverrate,whichisbasedonmodelingtheobservedvaluesoftheteststatisticasamixtureoftwodistributions,wheremostoftheobservationscomefromthenulldistributionandsomeobservations(theinterestingones)comefromthealternativedistribution.(SeeSection8.2.2formoreinformationonmixturemodels.)IfZisourteststatistic,thedensity,

ofZ,isgivenbythefollowing:

where istheprobabilityaninstancecomesfromthenulldistribution,isthedistributionofthep-valuesunderthenullhypothesis, istheprobabilityaninstancecomesfromthealternativedistribution,and isthedistributionofp-valuesunderthealternativehypothesis.

UsingBayestheorem,wecanderivetheprobabilityofthenullhypothesisforanyvalueofzasfollows.

Thequantity, ,isthequantitythatwewouldliketodefineasthelocalfdr.Since isoftencloseto1,thelocalfalsediscoveryrate,representedasfdr,alllowercase,isdefinedasthefollowing:

0.7592 0.7879 0.8032 0.8078 0.8078 0.8108 0.8108 0.8108 0.8129 0.8150

0.8150 0.8150 0.8150 0.8155 0.8155 0.8159 0.8159 0.8160 0.8161 0.8161

f(z)

f(z)=p0f0(z)+p1f1(z), (10.13)

p0 f0(z)p1

f1(z)

p(H0|z)=f(H0 and z)/f(z)=p0f0(z)/f(z) (10.14)

p(H0|z)p0

Thisisapointestimate,notanintervalestimateaswiththestandardFDR,whichisbasedonp-value,andassuch,itwillvarywiththevalueoftheteststatistic.Notethatthelocalfdrhasaneasyinterpretation,namelyastheratioofthedensityofobservationsfromthenulldistributiontoobservationsfromboththenullandalternativedistributions.Italsohastheadvantageofbeinginterpretabledirectlyasarealprobability.

Thechallenge,ofcourse,isinestimatingthedensitiesinvolvedinEquation10.15 ,whichareusuallyestimatedempirically.Weconsiderthefollowingsimplecase,wherewespecifythedistributionsbyGaussiandistributions.Thenulldistributionisgivenby, ,whilethealternativedistributionisgivenby . and .

a. Compute forthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.

b. Computethelocalfdrforthefollowingvaluesofz:2,2.5,3,3.5,4,4.5,5.

c. Howclosearethesetwosetsofvalues?

11.Thefollowingaretwoalternativestoswaprandomization—presentedinSection10.4.2 —forrandomizingabinarymatrixsothatthenumberof1’sinanyrowandcolumnarepreserved.Examineeachmethodand(i)verifythatitdoesindeedpreservethenumberof1’sinanyrowandcolumn,and(ii)identifytheproblemwiththealternativeapproach.

a. Randomlypermutetheorderofthecolumnsandrows.AnexampleisshowninFigure10.12 .

fdr(z)=f0(z)f(z) (10.15)

f0(z)=N(0,1)f0(z)=N(3,1) p0=0.999 p1=0.001

p(H0|z)

Figure10.12.A3×3matrixbeforeandafterrandomizingtheorderoftherowsandcolumns.Theleftmostmatrixistheoriginal.

b. Figure10.13 showsanotherapproachtorandomizingabinarymatrix.Thisapproachconvertsthebinarymatrixtoarow-columnrepresentation,thenrandomlyreassignsthecolumnstovariousentries,andfinallyconvertsthedatabackintotheoriginalbinarymatrixformat.

Figure10.13.A4×4matrixbeforeandafterrandomizingtheentries.Fromrighttoleft,thetablesrepresentthefollowing:Theoriginalbinarydatamatrix,thematrixinrow-columnformat,therow-columnformatafterrandomlypermutingtheentriesinthecolcolumn,andthematrixreconstructedfromtherandomizedrow-columnrepresentation.

AuthorIndex

Abdulghani,A.,433Abe,N.,345Abello,J.,698Abiteboul,S.,437Abraham,B.,745Adams,N.M.,430Adolfi,F.G.,807Adomavicius,G.,18Aelst,S.V.,749Afshar,R.,509Agarwal,R.C.,342,429,747Aggarwal,C.,18,430,507,509Aggarwal,C.C.,339,429,430,600,602,745,808Agrawal,R.,18,104,183,184,430,431,435,436,507,509,696Aha,D.W.,181,339Akaike,H.,181Akoglu,L.,745Aksehirli,E.,434Albers,C.,807Alcalá-Fdez,J.,508Aldenderfer,M.S.,600Alexandridis,M.G.,184Ali,K.,430Allen,D.M.,181Allison,D.B.,181Allwein,E.L.,340Alsabti,K.,181Altman,R.B.,19Alvarez,J.L.,508Amatriain,X.,18Ambroise,C.,181

Anderberg,M.R.,102,600Anderson,T.W.,849Andrews,R.,340Ankerst,M.,600Antonie,M.-L.,507Anvari,F.,807Aone,C.,601Apps,M.A.,807Arabie,P.,600,602Argamon,S.E.,807Arnold,A.,746Arthur,D.,600Atluri,G.,19,508Atwal,G.S.,103Aumann,Y.,507Austin,J.,747Ayres,J.,507

Baguley,T.,807Bai,H.,696Baker,M.,806Baker,R.J.,343Bakiri,G.,341Bakos,J.,438Baldi,P.,340Ball,G.,600Bandyopadhyay,S.,103Banerjee,A.,18,19,600,696,746Barbará,D.,430,696Barnett,V.,745Bass,D.,806Bastide,Y.,435Basu,S.,696Batistakis,Y.,601Baxter,R.A.,747Bay,S.D.,430,746Bayardo,R.,430Becker,R.,807Beckman,R.J.,746Belanche,L.,104Belkin,M.,849Ben-David,S.,19Bengio,S.,341Bengio,Y.,340,341,343,345Benjamin,D.J.,806Benjamini,Y.,430,806Bennett,K.,340Benning,S.D.,807Berger,J.,806

Berk,R.,806Berkhin,P.,600Bernecker,T.,430Berrar,D.,340Berry,M.J.A.,18Bertino,E.,20Bertolami,R.,342Bhaskar,R.,430Bienenstock,E.,182Bilmes,J.,696Bins,M.,183Bishop,C.M.,181,340,696Blashfield,R.K.,600Blei,D.M.,696Blondel,M.,20,184Blum,A.,102,340Bobadilla,J.,18Bock,H.-H.,600Bock,H.H.,102Boiteau,V.,600Boley,D.,600,602Bollen,K.,806Bolton,R.J.,430Borg,I.,102Borgwardt,K.M.,434Boswell,R.,340Bosworth,A.,103Bottou,L.,340Boulicaut,J.-F.,507Bowyer,K.W.,340Bradley,A.P.,340

Bradley,P.S.,18,438,600,696Bradski,G.,182Bratko,I.,183Breiman,L.,181,340Brembs,B.,806Breslow,L.A.,181Breunig,M.M.,600,746Brin,S.,430,431,436Brockwell,A.,20Brodley,C.E.,184,747Brown,L.,806Brucher,M.,20,184Bunke,H.,342Buntine,W.,181Burges,C.J.C.,340Burget,L.,343Burke,R.,18Buturovic,L.J.,183Bykowski,A.,507

Cai,C.H.,431Cai,J.,438Camerer,C.,806Campbell,C.,340Canny,J.F.,344Cantú-Paz,E.,181Cao,H.,431Cardie,C.,699Carreira-Perpinan,M.A.,849Carvalho,C.,435Catanzaro,B.,182Ceri,S.,434Cernock`y,J.,343Chakrabarti,S.,18Chan,P.K.,19,341Chan,R.,431Chandola,V.,746Chang,E.Y.,696Chang,L.,749Charrad,M.,600Chatterjee,S.,698,747Chaudhary,A.,746Chaudhuri,S.,103Chawla,N.V.,340,746Chawla,S.,746Cheeseman,P.,696Chen,B.,436Chen,C.,746Chen,M.-S.,18,435,509Chen,Q.,431,508,749Chen,S.,434

Chen,S.-C.,749Chen,W.Y.,696Chen,Z.,749Cheng,C.H.,431Cheng,H.,432,507Cheng,R.,437Cherkassky,V.,18,182,340Chervonenkis,A.Y.,184Cheung,D.,749Cheung,D.C.,431Cheung,D.W.,431,437Chiu,B.,747Chiu,J.,433Choudhary,A.,434,698Chrisman,N.R.,102Chu,C.,182Chu,G.,808Chuang,A.,745Chui,C.K.,431Chung,S.M.,433,508Clark,P.,340Clifton,C.,18,437Clifton,D.A.,748Clifton,L.,748Coatney,M.,435,508Cochran,W.G.,102Codd,E.F.,102Codd,S.B.,102Cohen,P.R.,183,807Cohen,W.W.,340Collingridge,D.S.,806

Contreras,P.,600,602Cook,D.J.,698Cook,R.D.,746Cooley,R.,431Cost,S.,340Cotter,A.,182Cournapeau,D.,20,184Courville,A.,340,341Courville,A.C.,341Couto,J.,430Cover,T.M.,102,341Cristianini,N.,104,341Cui,X.,181

Dabney,A.,806Dash,M.,103Datta,S.,19Davidson,I.,696Davies,L.,746Davis,R.W.,808Dayal,U.,431,508,602Dean,J.,890Demmel,J.W.,102,832,849Deng,A.,807Desrosiers,C.,18Dhillon,I.S.,600Diaz-Verdejo,J.,746Diday,E.,102Diederich,J.,340Dietterich,T.G.,341,343Ding,C.,437,696Ding,C.H.Q.,698Dokas,P.,431Domingos,P.,18,182,341Dong,G.,431Donoho,D.L.,849Doster,W.,184Dougherty,J.,102Doursat,R.,182Drummond,C.,341Dubes,R.C.,103,601,807Dubourg,V.,20,184Duchesnay,E.,20,184Duda,R.O.,18,182,341,600Duda,W.,344

Dudoit,S.,182Duin,R.P.W.,182,344DuMouchel,W.,431Dunagan,J.,746Dunham,M.H.,18,341Dunkel,B.,431

Edgington,E.,806Edwards,D.D.,344Efron,B.,182,806Elkan,C.,341,601Ellis,P.D.,806Elomaa,T.,103Erhan,D.,341Erhart,M.,698EricksonIII,D.J.,103Ernst,M.D.,806Ertöz,L.,431,696,697,748Eskin,E.,746,748Esposito,F.,182Ester,M.,600–602Everitt,B.S.,601Evfimievski,A.V.,431Ezeife,C.,508

Fürnkranz,J.,341Fabris,C.C.,431Faghmous,J.,19Faghmous,J.H.,18Faloutsos,C.,20,104,748,849Fan,J.,697Fan,W.,341Fang,G.,432,436Fang,Y.,699Fawcett,T.,344Fayyad,U.M.,18,103,438,600,696Feng,L.,432,437Feng,S.,697Fernández,S.,342Ferri,C.,341Field,B.,432Finucane,H.K.,104Fisher,D.,601,697Fisher,N.I.,432Fisher,R.A.,182,806Flach,P.,340Flach,P.A.,341Flannick,J.,507Flynn,P.J.,601Fodor,I.K.,849Fournier-Viger,P.,507Fovino,I.N.,20Fox,A.J.,746Fraley,C.,697Frank,E.,19,20,182,345Frasca,B.,807

Frawley,W.,435Freeman,D.,696Freitas,A.A.,431,432Freund,Y.,341Friedman,J.,182,342Friedman,J.H.,19,181,432,601Fu,A.,431,433Fu,A.W.-c.,749Fu,Y.,432,508Fukuda,T.,432,434,507Fukunaga,K.,182,341Furuichi,E.,435

Gada,D.,103Ganguly,A.,19Ganguly,A.R.,18,103Ganti,V.,182,697Gao,X.,601GaohuaGu,F.H.,103Garcia-Teodoro,P.,746Garofalakis,M.N.,507Garriga,G.C.,807Gather,U.,746Geatz,M.,20Gehrke,J.,18,19,104,182,431,507,696,697Geiger,D.,341Geisser,S.,182Gelman,A.,806Geman,S.,182Gersema,E.,183Gersho,A.,697Ghazzali,N.,600Ghemawat,S.,890Ghosh,A.,746Ghosh,J.,600,697,699Giannella,C.,19Gibbons,P.B.,748Gibson,D.,697Gigerenzer,G.,806Gionis,A.,746,806Glymour,C.,19Gnanadesikan,R.,746Goethals,B.,434Goil,S.,698

Goldberg,A.B.,345Golub,G.H.,832Gomariz,A.,507Good,P.,806Goodfellow,I.,341Goodfellow,I.J.,341Goodman,R.M.,344Goodman,S.,806Gorfine,M.,103Gowda,K.C.,697Grama,A.,20Gramfort,A.,20,184Graves,A.,342Gray,J.,103Gray,R.M.,697Greenland,S.,806Gries,D.,890Grimes,C.,849Grinstein,G.G.,18Grisel,O.,20,184Groenen,P.,102Grossman,R.L.,19,698Grossman,S.R.,104Guan,Y.,600Guestrin,C.,20Guha,S.,19,697Gunopulos,D.,432,437,696Guntzer,U.,432Gupta,M.,432Gupta,R.,432,508Gutiérrez,A.,18

Hagen,L.,697Haibt,L.,344Haight,R.,436Haiminen,N.,807Halic,M.,183Halkidi,M.,601Hall,D.,600Hall,L.O.,340Hall,M.,19,182Hamerly,G.,601Hamilton,H.J.,432Han,E.,432Han,E.-H.,183,342,432,508,601,697,698Han,J.,18,19,342,430,432–435,437,507–509,601,698Hand,D.J.,19,103,342,430Hardin,J.,746Hart,P.E.,18,182,341,600Hartigan,J.,601Hastie,T.,19,182,342,601Hatonen,K.,437Hawkins,D.M.,747Hawkins,S.,747He,H.,747He,Q.,433,699He,X.,437,696He,Y.,437,697Hearst,M.,342Heath,D.,182Heckerman,D.,342Heller,R.,103Heller,Y.,103

Hernando,A.,18Hernández-Orallo,J.,341Herrera,F.,508Hey,T.,19Hidber,C.,432Hilderman,R.J.,432Hinneburg,A.,697Hinton,G.,343Hinton,G.E.,342–344Hipp,J.,432Ho,C.-T.,20Hochberg,Y.,430,806Hodge,V.J.,747Hofmann,H.,432Holbrook,S.R.,437Holder,L.B.,698Holland,J.,344Holmes,G.,19,182Holt,J.D.,433Holte,R.C.,341,342Hong,J.,343Hornick,M.F.,19Houtsma,M.,433Hsieh,M.J.,509Hsu,M.,431,508,602Hsu,W.,434Hsueh,S.,433Huang,H.-K.,748Huang,T.S.,698Huang,Y.,433Hubbard,R.,807

Hubert,L.,600,602Hubert,M.,749Hulten,G.,18Hung,E.,431Hussain,F.,103Hwang,S.,433Hämäläinen,W.,807Höppner,F.,697

Iba,W.,343Imielinski,T.,430,433Inokuchi,A.,433,508Ioannidis,J.P.,807Ioffe,S.,342Irani,K.B.,103

Jagadish,H.V.,747Jager,L.R.,807Jain,A.K.,19,103,182,601,807Jajodia,S.,430Janardan,R.,849Japkowicz,N.,340,342,746,747Jardine,N.,601Jaroszewicz,S.,433Jarvis,R.A.,697Jensen,D.,104,183,807Jensen,F.V.,342Jeudy,B.,507Johannesson,M.,806John,G.H.,103Johnson,T.,747Jolliffe,I.T.,103,849Jonyer,I.,698Jordan,M.I.,342,696Joshi,A.,19Joshi,M.V.,183,342,343,508,747

Kahng,A.,697Kailing,K.,698Kallio,A.,807Kalpakis,K.,103Kamath,C.,19,181,698Kamber,M.,19,342,433,601Kantarcioglu,M.,18Kantardzic,M.,19Kao,B.,431Karafiát,M.,343Kargupta,H.,19Karpatne,A.,19Karypis,G.,18,183,342,432,433,436,508,509,601,602,697,698Kasif,S.,182,183Kass,G.V.,183Kass,R.E.,807Kaufman,L.,103,601Kawale,J.,747Kegelmeyer,P.,19,698Kegelmeyer,W.P.,340Keim,D.A.,697Kelly,J.,696Keogh,E.,747Keogh,E.J.,103Keshet,J.,182Kettenring,J.R.,746Keutzer,K.,182Khan,S.,103Khan,S.S.,747Khardon,R.,432Khoshgoftaar,T.M.,20

Khudanpur,S.,343Kifer,D.,19Kim,B.,183Kim,S.K.,182Kinney,J.B.,103Kitagawa,H.,748Kitsuregawa,M.,436Kivinen,J.,343Klawonn,F.,697Kleinberg,J.,19Kleinberg,J.M.,601,697Klemettinen,M.,433,437Klooster,S.,435,436,698Knorr,E.M.,747Kogan,J.,600Kohavi,R.,102,103,183,807Kohonen,T.,698Kolcz,A.,340,746Kong,E.B.,343Koperski,K.,432Kosters,W.A.,433Koudas,N.,747Koutra,D.,745Kröger,P.,698Kramer,S.,509Krantz,D.,103–105Kriegel,H.,430Kriegel,H.-P.,600–602,698,746,748,749Krishna,G.,697Krizhevsky,A.,342–344Krstajic,D.,183

Kruse,R.,697Kruskal,J.B.,103,849Kröger,P.,601Kubat,M.,343Kuhara,S.,435Kulkarni,S.R.,183Kumar,A.,747Kumar,V.,18,19,183,342–344,431,432,435–437,508,509,601,602,696–698,746–748,849Kuok,C.M.,433Kuramochi,M.,433,508Kwok,I.,747Kwong,W.W.,431

Lagani,V.,184Lajoie,I.,345Lakens,D.,807Lakhal,L.,435Lakshmanan,L.V.S.,434Lambert,D.,19Landau,S.,601Lander,E.S.,104Landeweerd,G.,183Landgrebe,D.,183,184Lane,T.,747Langford,J.C.,345,849Langley,P.,102,343,697Larochelle,H.,345Larsen,B.,601Lavrac,N.,343Lavrač,N.,434Law,M.H.C.,19Laxman,S.,430Layman,A.,103Lazar,N.A.,808Lazarevic,A.,431,748Leahy,D.E.,183LeCun,Y.,343Lee,D.D.,698Lee,P.,433Lee,S.D.,431,437Lee,W.,433,748Lee,Y.W.,105Leek,J.T.,807,808Leese,M.,601

Lent,B.,508Leroy,A.M.,748Lewis,D.D.,343Lewis,T.,745Li,F.,699Li,J.,431Li,K.-L.,748Li,N.,433Li,Q.,849Li,T.,698Li,W.,105,433,438Li,Y.,430Li,Z.,601,602,808Liao,W.-K.,434Liess,S.,747Lim,E.,433Lin,C.J.,696Lin,K.-I.,434,849Lin,M.,433Lin,Y.-A.,182Lindell,Y.,507Lindgren,B.W.,103Lindquist,E.F.,807Ling,C.X.,343Linoff,G.,18Lipton,Z.C.,20Liu,B.,434,437,509Liu,H.,103,104Liu,J.,434Liu,L.-M.,746Liu,R.Y.,748

Liu,Y.,433,434,601Livny,M.,699Liwicki,M.,342Llinares-López,F.,434Lonardi,S.,747Lu,C.-T.,749Lu,H.J.,432,435,437Lu,Y.,438Luce,R.D.,103–105Ludwig,J.,19Lugosi,G.,183Luo,C.,508Luo,W.,697

Ma,D.,697Ma,H.,699Ma,L.,699Ma,Y.,434Mabroukeh,N.R.,508Maciá-Fernández,G.,746MacQueen,J.,601Madden,M.G.,747Madigan,D.,19Malerba,D.,182Maletic,J.I.,434Malik,J.,698Malik,J.M.,344Mamoulis,N.,431Manganaris,S.,430Mangasarian,O.,343Mannila,H.,19,342,432,437,508,806,807Manzagol,P.-A.,341,345Mao,H.,697Mao,J.,182Maranell,G.M.,104Marchiori,E.,433Marcus,A.,434Margineantu,D.D.,343Markou,M.,748Martin,D.,508Masand,B.,431Mata,J.,508Matsuzawa,H.,434Matwin,S.,343McCullagh,P.,343

McCulloch,W.S.,343McLachlan,G.J.,181McVean,G.,104Megiddo,N.,434Mehta,M.,183,184Meilǎ,M.,602MeiraJr.,W.,20Meo,R.,434Merugu,S.,600Meyer,G.,19Meyerson,A.,19,697Michalski,R.S.,183,343,698,699Michel,V.,20,184Michie,D.,183,184Mielikäinen,T.,806Mikolov,T.,343Miller,H.J.,601Miller,R.J.,434,508Milligan,G.W.,602Mingers,J.,183Mirkin,B.,602Mirza,M.,341Mishra,N.,19,697,698Misra,J.,890Mitchell,T.,20,183,340,343,602,698Mitzenmacher,M.,104Mobasher,B.,432,697Modha,D.S.,600Moens,S.,434Mok,K.W.,433,748Molina,L.C.,104

Montgomery,D.C.,807Mooney,R.,696Moore,A.W.,602,746Moret,B.M.E.,183Morimoto,Y.,432,434,507Morishita,S.,432,507Mortazavi-Asl,B.,435,508Mosteller,F.,104,434Motoda,H.,103,104,433,508Motwani,R.,19,430,431,436,437,697Mozetic,I.,343Mueller,A.,434Muggleton,S.,343Muirhead,C.R.,748Mulier,F.,18,340Mulier,F.M.,182Mullainathan,S.,19Murphy,K.P.,183Murphy,K.R.,807Murtagh,F.,600,602,698Murthy,S.K.,183Murty,M.N.,601Muthukrishnan,S.,747Myers,C.L.,508Myneni,R.,435Myors,B.,807Müller,K.-R.,849

Nagesh,H.,698Nakhaeizadeh,G.,432Namburu,R.,19,698Naughton,J.F.,438Navathe,S.,435,509Nebot,A.,104Nelder,J.A.,343Nelson,L.D.,808Nemani,R.,435Nestorov,S.,437Neyman,J.,807Ng,A.Y.,182,696Ng,R.T.,434,698,746,747Niblett,T.,183,340Nielsen,M.A.,343Niknafs,A.,600Nishio,S.,435Niyogi,P.,849Nobel,A.B.,434Norvig,P.,344Nosek,B.A.,806Novak,P.K.,434Nuzzo,R.,807

O’Callaghan,L.,19,697Oates,T.,104Oerlemans,A.,433Ogihara,M.,105,438Ohsuga,S.,438Ojala,M.,807Olken,F.,104Olshen,R.,181Olukotun,K.,182Omiecinski,E.,435,509Onghena,P.,806Ono,T.,435Orihara,M.,438Ortega,F.,18Osborne,J.,104Ostrouchov,G.,103others,19,184,602,748,806,807Ozden,B.,435Ozgur,A.,435,748

Padmanabhan,B.,438,509Page,G.P.,181Palit,I.,183Palmer,C.R.,104Pan,S.J.,343Pandey,G.,432,508Pang,A.,434Papadimitriou,S.,20,748Papaxanthos,L.,434Pardalos,P.M.,698Parelius,J.M.,748Park,H.,849Park,J.S.,435ParrRud,O.,20Parthasarathy,S.,105,435,438,508Pasquier,N.,435Passos,A.,20,184Patrick,E.A.,697Pattipati,K.R.,184Paulsen,S.,434Pazzani,M.,341,430Pazzani,M.J.,103Pearl,J.,341,344Pearson,E.S.,807Pedregosa,F.,20,184Pei,J.,19,432,433,435,508Pelleg,D.,602Pellow,F.,103Peng,R.D.,807Perrot,M.,20,184Pesarin,F.,807

Peters,M.,698Pfahringer,B.,19,182Piatetsky-Shapiro,G.,18,20,435Pimentel,M.A.,748Pirahesh,H.,103Pison,G.,749Pitts,W.,343Platt,J.C.,748Pohlmann,N.,807Portnoy,L.,746,748Potter,C.,435,436,698Powers,D.M.,344Prasad,V.V.V.,429Pregibon,D.,19,20,431Prerau,M.,746Prettenhofer,P.,20,184Prince,M.,344Prins,J.,434Protopopescu,V.,103Provenza,L.P.,20Provost,F.J.,104,344Psaila,G.,434Pujol,J.M.,18Puttagunta,V.,103Pyle,D.,20

Quinlan,J.R.,184,344

Raftery,A.E.,697,807Raghavan,P.,696,697Rakhshani,A.,184Ramakrishnan,N.,20Ramakrishnan,R.,18,104,182,697,699Ramaswamy,S.,435,748Ramkumar,G.D.,435Ramoni,M.,344Ranka,S.,181,435Rao,N.,508Rastogi,R.,507,697,748Reddy,C.K.,183,600,602,808Redman,T.C.,104Rehmsmeier,M.,344Reichart,D.,103Reina,C.,696Reisende,M.G.C.,698Renz,M.,430Reshef,D.,104Reshef,D.N.,104Reshef,Y.,104Reshef,Y.A.,104Reutemann,P.,19,182Ribeiro,M.T.,20Richter,L.,509Riondato,M.,435Riquelme,J.C.,508Rissanen,J.,183Rivest,R.L.,184Robinson,D.,806Rochester,N.,344

Rocke,D.M.,746,748Rogers,S.,699Roiger,R.,20Romesburg,C.,602Ron,D.,698Ronkainen,P.,437Rosenblatt,F.,344Rosenthal,A.,437Rosete,A.,508Rosner,B.,748Rotem,D.,104Rousseeuw,P.J.,103,601,748Rousu,J.,103Roweis,S.T.,849Ruckert,U.,509Runkler,T.,697Russell,S.J.,344Ruts,I.,748

Sabeti,P.,104Sabeti,P.C.,104Sabripour,M.,181Safavian,S.R.,184Sahami,M.,102Saigal,S.,103Saito,T.,344Salakhutdinov,R.,344Salakhutdinov,R.R.,342Salmaso,L.,807Salzberg,S.,182,183,340Samatova,N.,18,19Sander,J.,600–602,746Sarawagi,S.,435Sarinnapakorn,K.,749Satou,K.,435Saul,L.K.,849Savaresi,S.M.,602Savasere,A.,435,509Saygin,Y.,20Schölkopf,B.,344Schafer,J.,20Schaffer,C.,184Schapire,R.E.,340,341Scheuermann,P.,436Schikuta,E.,698Schmidhuber,J.,342,344Schroeder,M.R.,698Schroedl,S.,699Schubert,E.,748,749Schuermann,J.,184

Schwabacher,M.,746Schwartzbard,A.,746Schwarz,G.,184Schölkopf,B.,104,748,849Scott,D.W.,749Sebastiani,P.,344Self,M.,696Semeraro,G.,182Sendhoff,B.,696Seno,M.,436,509Settles,B.,344Seung,H.S.,698Shafer,J.C.,184,429Shasha,D.E.,20Shawe-Taylor,J.,104,341,748Sheikholeslami,G.,698Shekhar,S.,18,19,433,435,437,749Shen,W.,509Shen,Y.,431Sheng,V.S.,343Shi,J.,698Shi,Z.,433Shibayama,G.,435Shim,K.,507,697,748Shinghal,R.,433Shintani,T.,436Shu,C.,699Shyu,M.-L.,749Sibson,R.,601Siebes,A.P.J.M.,432Siegmund,D.,808

Silberschatz,A.,435,436Silva,V.d.,849Silver,N.,808Silverstein,C.,430,436Simmons,J.P.,808Simon,H.,696Simon,N.,104Simon,R.,184Simonsohn,U.,808Simovici,D.,433Simpson,E.-H.,436Singer,Y.,340Singh,K.,748Singh,L.,436Singh,S.,20,748Singh,V.,181Sivakumar,K.,19Smalley,C.T.,102Smith,A.D.,430Smola,A.J.,104,343,344,748,849Smyth,P.,18–20,342,344Sneath,P.H.A.,104,602Soete,G.D.,600,602Sokal,R.R.,104,602Song,Y.,696Soparkar,N.,431Speed,T.,104Spiegelhalter,D.J.,183Spiliopoulou,M.,431Späth,H.,602Srebro,N.,182

Srikant,R.,18,104,430,431,434,436,507,509Srivastava,J.,431,436,509,748Srivastava,N.,342,344Steinbach,M.,18,19,183,344,432,435–437,508,602,696–698,747Stepp,R.E.,698,699Stevens,S.S.,104Stolfo,S.J.,341,433,746,748Stone,C.J.,181Stone,M.,184Storey,J.D.,806,808Stork,D.G.,18,182,341,600Strang,G.,832Strehl,A.,699Strimmer,K.,808Struyf,A.,749Stutz,J.,696Su,X.,20Suen,C.Y.,185Sugiyama,M.,434Sun,S.,344Sun,T.,699Sun,Z.,430Sundaram,N.,182Suppes,P.,103–105Sutskever,I.,342–344Suzuki,E.,436Svensen,M.,696Swami,A.,430,433,508Swaminathan,R.,698Sykacek,P.,749Szalay,A.S.,746

Szegedy,C.,342

Takagi,T.,435Tan,C.L.,103Tan,H.,697Tan,P.-N.,344,698Tan,P.N.,183,431,435–437,509Tang,J.,749Tang,S.,435Tansley,S.,19Tao,D.,345Taouil,R.,435Tarassenko,L.,748Tatti,N.,437Tax,D.M.J.,344Tay,S.H.,437,509Taylor,C.C.,183Taylor,J.E.,808Taylor,W.,696Tenenbaum,J.B.,849Teng,W.G.,509Thakurta,A.,430Theodoridis,Y.,20Thirion,B.,20,184Thomas,J.A.,102Thomas,S.,183,435Thompson,K.,343Tian,S.-F.,748Tibshirani,R.,19,104,182,184,342,344,601,806,808Tibshirani,R.J.,184Tickle,A.,340Timmers,T.,183Toivonen,H.,20,105,432,437,508

Tokuyama,T.,432,434,507Tolle,K.M.,19Tompkins,R.G.,808Tong,H.,745Torregrosa,A.,436Tsamardinos,I.,184Tsaparas,P.,806Tseng,V.S.,507Tsoukatos,I.,437Tsur,S.,431,435,437Tucakov,V.,747Tukey,J.W.,104,105,748Tung,A.,437,601Turnbaugh,P.J.,104Tusher,V.,806Tusher,V.G.,808Tuzhilin,A.,18,436,438,509Tversky,A.,103–105Tzvetkov,P.,509

Ullman,J.,431,437Uslaner,E.M.,103Utgoff,P.E.,184Uthurusamy,R.,18

Vaidya,J.,18,437Valiant,L.,184vanAssen,M.A.,807vanRijsbergen,C.J.,344vanZomeren,B.C.,748Vanderplas,J.,20,184Vandin,F.,435vanderLaan,M.J.,182VanLoan,C.F.,832Vapnik,V.,345Vapnik,V.N.,184Varma,S.,184Varoquaux,G.,20,184Vassilvitskii,S.,600Vazirgiannis,M.,601Velleman,P.F.,105Vempala,S.,746Venkatesh,S.S.,183Venkatrao,M.,103Verhein,F.,430Verkamo,A.I.,508Verma,T.S.,341Verykios,V.S.,20Vincent,P.,340,341,345Virmani,A.,433Vitter,J.S.,890vonLuxburg,U.,699vonSeelen,W.,696vonderMalsburg,C.,696Vorbruggen,J.C.,696Vreeken,J.,808

Vu,Q.,436Vuokko,N.,807Vázquez,E.,746

Wagenmakers,E.-J.,806Wagstaff,K.,696,699Wainwright,M.,342Walker,T.,807Wang,H.,184Wang,J.,430,509Wang,J.T.L.,20Wang,K.,437,509Wang,L.,437Wang,Q.,19Wang,Q.R.,185Wang,R.Y.,105Wang,W.,432,434Wang,Y.R.,105Warde-Farley,D.,341Washio,T.,433,508Wasserstein,R.L.,808Webb,A.R.,20,345Webb,G.I.,434,437,509,808Weiss,G.M.,345Weiss,R.,20,184Welch,W.J.,808Werbos,P.,345Widmer,G.,341Widom,J.,508Wierse,A.,18Wilhelm,A.F.X.,432Wilkinson,L.,105Williams,C.K.I.,696Williams,G.J.,747Williamson,R.C.,343,748

Wimmer,M.,600Wish,M.,849Witten,I.H.,19,20,182,345Wojdanowski,R.,748Wolach,A.,807Wong,M.H.,433Woodruff,D.L.,748Wu,C.-W.,507Wu,J.,601Wu,N.,430Wu,S.,601Wu,X.,20,344,509Wunsch,D.,602

Xiang,D.,748Xiao,W.,808Xin,D.,432Xiong,H.,433,436,437,601,602,808,849Xu,C.,345Xu,R.,602Xu,W.,748Xu,X.,600–602Xu,Y.,807

Yamamura,Y.,435Yan,X.,19,432,437,507,509Yang,C.,438Yang,Q.,343,431Yang,Y.,185,434,508Yao,Y.Y.,438Ye,J.,849Ye,N.,104,697,749Yesha,Y.,19Yin,Y.,432Yiu,T.,507Yoda,K.,434Yu,H.,436,699Yu,J.X.,432Yu,L.,104Yu,P.S.,18–20,430,435,745Yu,Y.,182

Zaïane,O.R.,432,507Zadrozny,B.,345Zahn,C.T.,602Zaki,M.J.,20,105,438,509,698Zaniolo,C.,184Zeng,C.,438Zeng,L.,433Zhang,A.,698Zhang,B.,438,602Zhang,C.,509Zhang,F.,438Zhang,H.,438,509Zhang,J.,341Zhang,M.-L.,345Zhang,N.,19Zhang,P.,749Zhang,S.,509Zhang,T.,699Zhang,Y.,185,437,438Zhang,Z.,438Zhao,W.,699Zhao,Y.,602Zhong,N.,438Zhou,Z.-H.,345Zhu,H.,435Zhu,X.,345Ziad,M.,105Zimek,A.,601,748,749Züfle,A.,430

SubjectIndex

k-nearestneighborgraph,657,663,664

accuracy,119,196activationfunction,251AdaBoost,306aggregation,51–52anomalydetection

applications,703–704clustering-based,724–728

example,726impactofoutliers,726membershipinacluster,725numberofclusters,728strengthsandweaknesses,728

definition,705–706definitions

distance-based,719density-based,720–724deviationdetection,703exceptionmining,703outliers,703proximity-based

distance-based,seeanomalydetection,distance-basedrelativedensity,722–723

example,723statistical,710–719

Gaussian,710Grubbs,751likelihoodapproach,715multivariate,712strengthsandweaknesses,718

techniques,708–709Apriori

algorithm,364principle,363

associationanalysis,357categoricalattributes,451continuousattributes,454indirect,503pattern,358rule,seerule

attribute,26–33definitionof,27numberofvalues,32type,27–32

asymmetric,32–33binary,32continuous,30,32discrete,32generalcomments,33–34interval,29,30nominal,29,30ordinal,29,30qualitative,30quantitative,30ratio,29

avoidingfalsediscoveries,755–806considerationsforanomalydetection,800–803considerationsforassociationanalysis,787randomization,793–795considerationsforclassification,783–787considerationsforclusteranalysis,795–800generatinganulldistribution,776–783

permutationtest,781randomization,781

hypothesistesting,seehypothesistestingmultiplehypothesistesting,seeFalseDiscoveryRateproblemswithsignificanceandhypothesistesting,778

axon,249

backpropagation,258bagging,seeclassifierBayes

naive,seeclassifiernetwork,seeclassifiertheorem,214

biasvariancedecomposition,300binarization,seediscretization,binarization,452,455BIRCH,684–686BonferroniProcedure),768boosting,seeclassifierBregmandivergence,94–95

candidategeneration,367,368,471,487itemset,362pruning,368,472,493rule,381sequence,468

case,seeobjectchameleon,660–666

algorithm,664–665graphpartitioning,664,665mergingstrategy,662relativecloseness,663relativeinterconnectivity,663self-similarity,656,661,663–665strengthsandlimitations,666

characteristic,seeattributecityblockdistance,seedistance,cityblockclass

imbalance,313classification

classlabel,114evaluation,119

classifierbagging,302base,296Bayesianbelief,227boosting,305combination,296decisiontree,119ensemble,296logisticregression,243

maximalmargin,278naive-Bayes,218nearestneighbor,208neuralnetworks,249perceptron,250probabilistic,212randomforest,310Rote,208rule-based,195supportvectormachine,276unstable,300

climateindices,680clusteranalysis

algorithmcharacteristics,619–620mappingtoanotherdomain,620nondeterminism,619optimization,620orderdependence,619parameterselection,seeparameterselectionscalability,seescalability

applications,525–527asanoptimizationproblem,620chameleon,seechameleonchoosinganalgorithm,690–693clustercharacteristics,617–618

datadistribution,618density,618poorlyseparated,618relationships,618shape,618size,618

subspace,618clusterdensity,618clustershape,548,618clustersize,618datacharacteristics,615–617

attributetypes,617datatypes,617high-dimensionality,616mathematicalproperties,617noise,616outliers,616scale,617size,616sparseness,616

DBSCAN,seeDBSCANdefinitionof,525,528DENCLUE,seeDENCLUEdensity-basedclustering,644–656fuzzyclustering,seefuzzyclusteringgraph-basedclustering,656–681

sparsification,657–658grid-basedclustering,seegrid-basedclusteringhierarchical,seehierarchicalclustering

CURE,seeCURE,seeCUREminimumspanningtree,658–659

Jarvis-Patrick,seeJarvis-PatrickK-means,seeK-meansmixturemodes,seemixturemodelsopossum,seeopossumparameterselection,567,587,619prototype-basedclustering,621–644

seesharednearestneighbor,density-basedclustering,679self-organizingmaps,seeself-organizingmapsspectralclustering,666subspaceclustering,seesubspaceclusteringsubspaceclusters,618typesofclusterings,529–531

complete,531exclusive,530fuzzy,530hierarchical,529overlapping,530partial,531partitional,529

typesofclusters,531–533conceptual,533density-based,532graph-based,532prototype-based,531well-separated,531

validation,seeclustervalidationclustervalidation,571–597

assessmentofmeasures,594–596clusteringtendency,571,588cohesion,574–579copheneticcorrelation,586forindividualclusters,581forindividualobjects,581hierarchical,585,594numberofclusters,587relativemeasures,574separation,574–578

silhouettecoefficient,581–582supervised,589–594

classificationmeasures,590–592similaritymeasures,592–594

supervisedmeasures,573unsupervised,574–589unsupervisedmeasures,573withproximitymatrix,582–585

codeword,332compactionfactor,400concepthierarchy,462conditionalindependence,229confidence

factor,196level,857measure,seemeasure

confusionmatrix,118constraint

maxgap,475maxspan,474mingap,475timing,473windowsize,476

contingencytable,402correlation

ϕ-coefficient,406coverage,196criticalregion,seehypothesistesting,criticalregioncross-validation,165CURE,686–690

algorithm,686,688

clusterfeature,684clusteringfeaturetree,684useofpartitioning,689–690useofsampling,688–689

curseofdimensionality,292

dag,seegraphdata

attribute,seeattributeattributetypes,617cleaning,seedataquality,datacleaningdistribution,618high-dimensional,616

problemswithsimilarity,673marketbasket,357mathematicalproperties,617noise,616object,seeobjectoutliers,616preprocessing,seepreprocessingquality,seedataqualityscale,617set,seedatasetsimilarity,seesimilaritysize,616sparse,616transformations,seetransformationstypes,617typesof,23,26–42

dataquality,23,42–50applicationissues,49–50

datadocumentation,50relevance,49timliness,49

datacleaning,42errors,43–48

accuracy,45

artifacts,44bias,45collection,43duplicatedata,48inconsistentvalues,47–48measurment,43missingvalues,46–47noise,43–44outliers,46precision,45significantdigits,45

dataset,26characteristics,34–35

dimensionality,34resolution,35sparsity,34

typesof,34–42graph-based,37–38matrix,seematrixordered,38–41record,35–37sequence,40sequential,38spatial,41temporal,38timeseries,39transaction,36

DBSCAN,565–569algorithm,567comparisontoK-means,614–615complexity,567

definitionofdensity,565parameterselection,567typesofpoints,566

border,566core,566noise,566

decisionboundary,146list,198stump,303tree,seeclassifier

deduplication,48DENCLUE,652–656

algorithm,653implementationissues,654kerneldensityestimation,654strengthsandlimitations,654

dendrite,249density

centerbased,565dimension,seeattributedimensionality

curse,57dimensionalityreduction,56–58,833–848

factoranalysis,840–842FastMap,845ISOMAP,845–847issues,847–848LocallyLinearEmbedding,842–844multidimensionalscaling,844–845PCA,58

SVD,58discretization,63–69,221

association,seeassociationbinarization,64–65clustering,456equalfrequency,456equalwidth,456ofbinaryattributes,seediscretization,binarizationofcategoricalvariables,68–69ofcontinuousattributes,65–68

supervised,66–68unsupervised,65–66

dissimilarity,76–78,94–95choosing,98–100definitionof,72distance,seedistancenon-metric,77transformations,72–75

distance,76–77cityblock,76Euclidean,76,822Hamming,332L1norm,76L2norm,76L∞,76Lmax,76Mahalanobis,96Manhattan,76metric,77

positivity,77symmetry,77

triangleinequality,77Minkowski,76–77supremum,76

distributionbinomial,162Gaussian,162,221

eagerlearner,seelearneredge,480effectsize,seehypothesistesting,effectsizeelement,466EMalgorithm,631–637ensemblemethod,seeclassifierentity,seeobjectentropy,67,128

useindiscretization,seediscretization,supervisederror

error-correctingoutputcoding,331generalization,156pessimistic,158

errorrate,119estimateerror,164Euclideandistance,seedistance,Euclideanevaluation

association,401exhaustive,198

factoranalysis,seedimensionalityreduction,factoranalysisFalseDiscoveryRate,778

Benjamini-HochbergFDR,772family-wiseerrorrate,768FastMap,seedimensionalityreduction,FastMapfeature

irrelevant,144featurecreation,61–63

featureextraction,61–62mappingdatatoanewspace,62–63

featureextraction,seefeaturecreation,featureextractionfeatureselection,58–61

architecturefor,59–60featureweighting,61irrelevantfeatures,58redundantfeatures,58typesof,58–59

embedded,58filter,59wrapper,59

field,seeattributeFouriertransform,62FP-growth,393FP-tree,seetreefrequentsubgraph,479fuzzyclustering,621–626

fuzzyc-means,623–626algorithm,623centroids,624example,626initialization,624

SSE,624strenthsandlimitations,626weightupdate,625

fuzzysets,622fuzzypsuedo-partition,623

gainratio,135generalization,seeruleginiindex,128graph,480

connected,484directedacyclic,462Laplacian,667undirected,484

grid-basedclustering,644–648algorithm,645clusters,646density,645gridcells,645

hierarchicalclustering,554–565agglomerativealgorithm,555centroidmethods,562clusterproximity,555

Lance-Williamsformula,562completelink,555,558–559complexity,556groupaverage,555,559–560inversions,562MAX,seecompletelinkmergingdecisions,564MIN,seesinglelinksinglelink,555,558Ward’smethod,561

high-dimensionalityseedata,high-dimensional,616

holdout,165hypothesis

alternative,459,858null,459,858

hypothesistesting,761criticalregion,763effectsize,766power,764TypeIerror,763TypeIIerror,764

independenceconditional,218

informationgainentropy-based,131FOIL’s,201

interest,seemeasureISOMAP,seedimensionalityreduction,ISOMAPisomorphism

definition,481item,seeattribute,358

competing,494negative,494

itemset,359candidate,seecandidateclosed,386maximal,384

Jarvis-Patrick,676–678algorithm,676example,677strengthsandlimitations,677

K-means,534–553algorithm,535–536bisecting,547–548centroids,537,539

choosinginitial,539–544comparisontodBSCAN,614–615complexity,544derivation,549–553emptyclusters,544incremental,546K-means++,543–544limitations,548–549objectivefunctions,537,539outliers,545reducingSEE,545–546

kerneldensityestimation,654kernelfunction,90–94

L1norm,seedistance,L1normL2norm,seedistance,L2normLagrangian,280lazylearner,seelearnerlearner

eager,208,211lazy,208,211

leastsquares,831leave-one-out,167lexicographicorder,371linearalgebra,817–832

matrix,seematrixvector,seevector

linearregression,831linearsystemsofequations,831lineartransformation,seematrix,lineartransformationLocallyLinearEmbedding,seedimensionalityreduction,LocallyLinearEmbedding

m-estimate,224majorityvoting,seevotingManhattandistance,seedistance,Manhattanmargin

soft,284marketbasketdata,seedatamatrix,37,823–829

addition,824–825columnvector,824confusion,seeconfusionmatrixdefinition,823–824document-term,37eigenvalue,829eigenvaluedecomposition,829–830eigenvector,829indataanalysis,831–832inverse,828–829linearcombination,835lineartransformations,827–829

columnspace,828leftnullspace,828nullspace,828projection,827reflection,827rotation,827rowspace,828scaling,827

multiplication,825–827positivesemidefinite,835rank,828rowvector,824

scalarmultiplication,825singularvalue,830singularvaluedecomposition,830singularvector,830sparse,37

maxgap,seeconstraintmaximumlikelihoodestimation,629–631maxspan,seeconstraintMDL,160mean,222measure

confidence,360consistency,408interest,405IS,406objective,401properties,409support,360symmetric,414

measurement,27–32definitionof,27scale,27

permissibletransformations,30–31types,27–32

metricaccuracy,119

metricsclassification,119

min-Apriori,461mingap,seeconstraintminimumdescriptionlength,seeMDL

missingvalues,seedataquality,errors,missingvaluesmixturemodels,627–637

advantagesandlimitations,637definitionof,627–629EMalgorithm,631–637maximumlikelihoodestimation,629–631

modelcomparison,173descriptive,116generalization,118overfitting,147predictive,116selection,156

modelcomplexityOccam’sRazor

AIC,157BIC,157

monotonicity,364multiclass,330multidimensionalscaling,seedimensionalityreduction,multidimensionalscalingmultiplecomparison,seeFalseDiscoveryRatemultiplehypothesistesting,seeFalseDiscoveryRate

family-wiseerrorrate,seefamily-wiseerrorratemutualexclusive,198mutualinformation,88–89

nearestneighborclassifier,seeclassifiernetwork

Bayesian,seeclassifiermultilayer,seeclassifierneural,seeclassifier

neuron,249node

internal,120leaf,120non-terminal,120root,120terminal,120

noise,211nulldistribution,758nullhypothesis,757

object,26observation,seeobjectOccam’srazor,157OLAP,51opposum,659–660

algorithm,660strengthsandweaknesses,660

outliers,seedataqualityoverfitting,seemodel,149

p-value,759pattern

cross-support,420hyperclique,423infrequent,493negative,494negativelycorrelated,495,496sequential,seesequentialsubgraph,seesubgraph

PCA,833–836examples,836mathematics,834–835

perceptron,seeclassifierpermutationtest,781Piatesky-Shapiro

PS,405point,seeobjectpower,seehypothesistesting,powerPrecision-RecallCurve,328precondition,195preprocessing,23,50–71

aggregation,seeaggregationbinarization,seediscretization,binarizationdimensionalityreduction,56discretization,seediscretizationfeaturecreation,seefeaturecreationfeatureselection,seefeatureselectionsampling,seesamplingtransformations,seetransformations

proximity,71–100choosing,98–100

cluster,555definitionof,71dissimilarity,seedissimilaritydistance,seedistanceforsimpleattributes,74–75issues,96–98

attributeweights,98combiningproximities,97–98correlation,96standardization,96

similarity,seesimilaritytransformations,72–74

pruningpost-pruning,163prepruning,162

randomforestseeclassifier,310

randomization,781associationpatterns,793–795

ReceiverOperatingCharacteristiccurve,seeROCrecord,seeobjectreducederrorpruning,189,346regression

logistic,243ROC,323Roteclassifier,seeclassifierrule

antecedent,195association,360candidate,seecandidateconsequent,195generalization,458generation,205,362,380,458ordered,198ordering,206pruning,202quantitative,454

discretization-based,454non-discretization,460statistics-based,458

redundant,458specialization,458validation,459

ruleset,195

sample,seeobjectsampling,52–56,314

approaches,53–54progressive,55–56random,53stratified,54withreplacement,54withoutreplacement,54

samplesize,54–55scalability

clusteringalgorithms,681–690BIRCH,684–686CURE,686–690generalissues,681–684

segmentation,529self-organizingmaps,637–644

algorithm,638–641applications,643strengthsandlimitations,643

sensitivity,319sequence

datasequence,468definition,466

sequentialpattern,464patterndiscovery,468timingconstraints,seeconstraint

sequentialcovering,199sharednearestneighbor,656

density,678–679density-basedclustering,679–681

algorithm,680example,680strengthsandlimitations,681

principle,657similarity,673–676

computation,675differencesindensity,674versusdirectsimilarity,676

significancelevel,859

significancetesting,761nulldistribution,seenulldistributionnullhypothesis,seenullhypothesisp-value,seep-valuestatisticalsignificance,seestatisticalsignificance

similarity,24,78–85choosing,98–100correlation,83–85cosine,81–82,822definitionof,72differences,85–88extendedJaccard,83Jaccard,80–81kernelfunction,90–94mutualinformation,88–89sharednearestneighbor,seesharednearestneighbor,similaritysimplematchingcoefficient,80Tanimoto,83transformations,72–75

Simpson’sparadox,416softsplitting,178

SOM,618,seeself-organizingmapsspecialization,seerulesplitinformation,135statisticalsignificance,760statistics

covarinacematrix,834subgraph

core,487definition,482pattern,479support,seesupport

subsequence,467contiguous,475

subspaceclustering,648–652CLIQUE,650

algorithm,651monotonicityproperty,651strengthsandlimitations,652

example,648subtree

replacement,163support

count,359counting,373,473,477,493limitation,402measure,seemeasurepruning,364sequence,468subgraph,483

supportvector,276supportvectormachine,seeclassifier

SVD,838–840example,838–840mathematics,838

SVM,seeclassifiernonlinear,290

svmnon-separable,284

synapse,249

taxonomy,seeconcepthierarchytransaction,358

extended,463width,379

transformations,69–71betweensimilarityanddissimilarity,72–74normalization,70–71simplefunctions,70standardization,70–71

treeconditionalFP-tree,398decision,seeclassifierFP-tree,394hash,375oblique,146

triangleinequality,77truepositive,319TypeIerror,seehypothesistesting,TypeIerrorTypeIIerror,seehypothesistesting,TypeIIerror

underfitting,149universalapproximator,261

variable,seeattributevariance,222vector,817–823

addition,817–818column,seematrix,columnvectordefinition,817dotproduct,820–822indataanalysis,822–823linearindependence,821–822mean,823mulitplicationbyascalar,818–819norm,820orthogonal,819–821orthogonalprojection,821row,seematrix,rowvectorspace,819–820basis,819dimension,819independentcomponents,819linearcombination,819span,819

vectorquantization,527vertex,480voting

distance-weighted,210majority,210

wavelettransform,63webcrawler,138windowsize,seeconstraint

CopyrightPermissionsSomefiguresandpartofthetextofChapter8originallyappearedinthearticle“FindingClustersofDifferentSizes,Shapes,andDensitiesinNoisy,HighDimensionalData,”LeventErtöz,MichaelSteinbach,andVipinKumar,ProceedingsoftheThirdSIAMInternationalConferenceonDataMining,SanFrancisco,CA,May1–3,2003,SIAM.©2003,SIAM.

SomefiguresandpartofthetextofChapter5appearedinthearticle“SelectingtheRightObjectiveMeasureforAssociationAnalysis,”Pang-NingTan,VipinKumar,andJaideepSrivastava,InformationSystems,29(4),293-313,2004,Elsevier.©2004,Elsevier.

SomeofthefiguresandtextofChapters8appearedinthearticle“DiscoveryofClimateIndicesUsingClustering,”MichaelSteinbach,Pang-NingTan,VipinKumar,StevenKlooster,andChristopherPotter,KDD’03:ProceedingsoftheNinthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,446–455,Washington,DC,August2003,ACM.©2003,ACM,INC.DOI=http://doi.acm.org/10.1145/956750.956801

Someofthefigures(1-7,13)andtextofChapter7originallyappearedinthechapter“TheChallengeofClusteringHigh-DimensionalData,”LeventErtoz,MichaelSteinbach,andVipinKumarinNewDirectionsinStatisticalPhysics,Econophysics,Bioinformatics,andPatternRecognition,273–312,Editor,LucWille,Springer,ISBN3-540-43182-9.©2004,Springer-Verlag.

SomeofthefiguresandtextofChapter8originallyappearedinthearticle“Chameleon:HierarchicalClusteringUsingDynamicModeling,”byGeorge

Karypis,Eui-Hong(Sam)Han,andVipinKumar,IEEEComputer,Volume32(8),68-75,August,1999,IEEE.©1999,IEEE.

Contents1. INTRODUCTIONTODATAMINING2. INTRODUCTIONTODATAMINING3. PrefacetotheSecondEdition

A. OverviewB. WhatisNewintheSecondEdition?C. TotheInstructorD. SupportMaterials

4. Contents5. 1Introduction

A. 1.1WhatIsDataMining?B. 1.2MotivatingChallengesC. 1.3TheOriginsofDataMiningD. 1.4DataMiningTasksE. 1.5ScopeandOrganizationoftheBookF. 1.6BibliographicNotesG. BibliographyH. 1.7Exercises

6. 2DataA. 2.1TypesofData

1. 2.1.1AttributesandMeasurementa. WhatIsanAttribute?b. TheTypeofanAttributec. TheDifferentTypesofAttributesd. DescribingAttributesbytheNumberofValuese. AsymmetricAttributes

f. GeneralCommentsonLevelsofMeasurement

2. 2.1.2TypesofDataSetsa. GeneralCharacteristicsofDataSets

a. Dimensionalityb. Distributionc. Resolution

b. RecordDataa. TransactionorMarketBasketDatab. TheDataMatrixc. TheSparseDataMatrix

c. Graph-BasedDataa. DatawithRelationshipsamongObjectsb. DatawithObjectsThatAreGraphs

d. OrderedDataa. SequentialTransactionDatab. TimeSeriesDatac. SequenceDatad. SpatialandSpatio-TemporalData

e. HandlingNon-RecordData

B. 2.2DataQuality1. 2.2.1MeasurementandDataCollectionIssues

a. MeasurementandDataCollectionErrorsb. NoiseandArtifactsc. Precision,Bias,andAccuracyd. Outliers

e. MissingValuesa. EliminateDataObjectsorAttributesb. EstimateMissingValuesc. IgnoretheMissingValueduringAnalysis

f. InconsistentValuesg. DuplicateData

2. 2.2.2IssuesRelatedtoApplications

C. 2.3DataPreprocessing1. 2.3.1Aggregation2. 2.3.2Sampling

a. SamplingApproachesb. ProgressiveSampling

3. 2.3.3DimensionalityReductiona. TheCurseofDimensionalityb. LinearAlgebraTechniquesforDimensionalityReduction

4. 2.3.4FeatureSubsetSelectiona. AnArchitectureforFeatureSubsetSelectionb. FeatureWeighting

5. 2.3.5FeatureCreationa. FeatureExtractionb. MappingtheDatatoaNewSpace

6. 2.3.6DiscretizationandBinarizationa. Binarizationb. DiscretizationofContinuousAttributes

a. UnsupervisedDiscretizationb. SupervisedDiscretization

c. CategoricalAttributeswithTooManyValues

7. 2.3.7VariableTransformationa. SimpleFunctionsb. NormalizationorStandardization

D. 2.4MeasuresofSimilarityandDissimilarity1. 2.4.1Basics

a. Definitionsb. Transformations

2. 2.4.2SimilarityandDissimilaritybetweenSimpleAttributes3. 2.4.3DissimilaritiesbetweenDataObjects

a. Distances

4. 2.4.4SimilaritiesbetweenDataObjects5. 2.4.5ExamplesofProximityMeasures

a. SimilarityMeasuresforBinaryDataa. SimpleMatchingCoefficientb. JaccardCoefficient

b. CosineSimilarityc. ExtendedJaccardCoefficient(TanimotoCoefficient)d. Correlatione. DifferencesAmongMeasuresForContinuousAttributes

6. 2.4.6MutualInformation7. 2.4.7KernelFunctions*

8. 2.4.8BregmanDivergence*9. 2.4.9IssuesinProximityCalculation

a. StandardizationandCorrelationforDistanceMeasuresb. CombiningSimilaritiesforHeterogeneousAttributesc. UsingWeights

10. 2.4.10SelectingtheRightProximityMeasure

E. 2.5BibliographicNotesF. BibliographyG. 2.6Exercises

7. 3Classification:BasicConceptsandTechniquesA. 3.1BasicConceptsB. 3.2GeneralFrameworkforClassificationC. 3.3DecisionTreeClassifier

1. 3.3.1ABasicAlgorithmtoBuildaDecisionTreea. Hunt'sAlgorithmb. DesignIssuesofDecisionTreeInduction

2. 3.3.2MethodsforExpressingAttributeTestConditions3. 3.3.3MeasuresforSelectinganAttributeTestCondition

a. ImpurityMeasureforaSingleNodeb. CollectiveImpurityofChildNodesc. Identifyingthebestattributetestconditiond. SplittingofQualitativeAttributese. BinarySplittingofQualitativeAttributesf. BinarySplittingofQuantitativeAttributesg. GainRatio

4. 3.3.4AlgorithmforDecisionTreeInduction

5. 3.3.5ExampleApplication:WebRobotDetection6. 3.3.6CharacteristicsofDecisionTreeClassifiers

D. 3.4ModelOverfitting1. 3.4.1ReasonsforModelOverfitting

a. LimitedTrainingSizeb. HighModelComplexity

E. 3.5ModelSelection1. 3.5.1UsingaValidationSet2. 3.5.2IncorporatingModelComplexity

a. EstimatingtheComplexityofDecisionTreesb. MinimumDescriptionLengthPrinciple

3. 3.5.3EstimatingStatisticalBounds4. 3.5.4ModelSelectionforDecisionTrees

F. 3.6ModelEvaluation1. 3.6.1HoldoutMethod2. 3.6.2Cross-Validation

G. 3.7PresenceofHyper-parameters1. 3.7.1Hyper-parameterSelection2. 3.7.2NestedCross-Validation

H. 3.8PitfallsofModelSelectionandEvaluation1. 3.8.1OverlapbetweenTrainingandTestSets2. 3.8.2UseofValidationErrorasGeneralizationError

I. 3.9ModelComparison1. 3.9.1EstimatingtheConfidenceIntervalforAccuracy

*

2. 3.9.2ComparingthePerformanceofTwoModels

J. 3.10BibliographicNotesK. BibliographyL. 3.11Exercises

8. 4Classification:AlternativeTechniquesA. 4.1TypesofClassifiersB. 4.2Rule-BasedClassifier

1. 4.2.1HowaRule-BasedClassifierWorks2. 4.2.2PropertiesofaRuleSet3. 4.2.3DirectMethodsforRuleExtraction

a. Learn-One-RuleFunctiona. RulePruningb. BuildingtheRuleSet

b. InstanceElimination

4. 4.2.4IndirectMethodsforRuleExtraction5. 4.2.5CharacteristicsofRule-BasedClassifiers

C. 4.3NearestNeighborClassifiers1. 4.3.1Algorithm2. 4.3.2CharacteristicsofNearestNeighborClassifiers

D. 4.4NaïveBayesClassifier1. 4.4.1BasicsofProbabilityTheory

a. BayesTheoremb. UsingBayesTheoremforClassification

2. 4.4.2NaïveBayesAssumption

a. ConditionalIndependenceb. HowaNaïveBayesClassifierWorks

c.EstimatingConditionalProbabilitiesforCategoricalAttributes

d.EstimatingConditionalProbabilitiesforContinuousAttributes

e. HandlingZeroConditionalProbabilitiesf. CharacteristicsofNaïveBayesClassifiers

E. 4.5BayesianNetworks1. 4.5.1GraphicalRepresentation

a. ConditionalIndependenceb. JointProbabilityc. UseofHiddenVariables

2. 4.5.2InferenceandLearninga. VariableEliminationb. Sum-ProductAlgorithmforTreesc. GeneralizationsforNon-TreeGraphsd. LearningModelParameters

3. 4.5.3CharacteristicsofBayesianNetworks

F. 4.6LogisticRegression1. 4.6.1LogisticRegressionasaGeneralizedLinearModel2. 4.6.2LearningModelParameters3. 4.6.3CharacteristicsofLogisticRegression

G. 4.7ArtificialNeuralNetwork(ANN)1. 4.7.1Perceptron

a. LearningthePerceptron

2. 4.7.2Multi-layerNeuralNetworka. LearningModelParameters

3. 4.7.3CharacteristicsofANN

H. 4.8DeepLearning1. 4.8.1UsingSynergisticLossFunctions

a. SaturationofOutputsb. Crossentropylossfunction

2. 4.8.2UsingResponsiveActivationFunctionsa. VanishingGradientProblemb. RectifiedLinearUnits(ReLU)

3. 4.8.3Regularizationa. Dropout

4. 4.8.4InitializationofModelParametersa. SupervisedPretrainingb. UnsupervisedPretrainingc. UseofAutoencodersd. HybridPretraining

5. 4.8.5CharacteristicsofDeepLearning

I. 4.9SupportVectorMachine(SVM)1. 4.9.1MarginofaSeparatingHyperplane

a. RationaleforMaximumMargin

2. 4.9.2LinearSVMa. LearningModelParameters

3. 4.9.3Soft-marginSVMa. SVMasaRegularizerofHingeLoss

4. 4.9.4NonlinearSVMa. AttributeTransformationb. LearningaNonlinearSVMModel

5. 4.9.5CharacteristicsofSVM

J. 4.10EnsembleMethods1. 4.10.1RationaleforEnsembleMethod2. 4.10.2MethodsforConstructinganEnsembleClassifier3. 4.10.3Bias-VarianceDecomposition4. 4.10.4Bagging5. 4.10.5Boosting

a. AdaBoost

6. 4.10.6RandomForests7. 4.10.7EmpiricalComparisonamongEnsembleMethods

K. 4.11ClassImbalanceProblem1. 4.11.1BuildingClassifierswithClassImbalance

a. OversamplingandUndersamplingb. AssigningScorestoTestInstances

2. 4.11.2EvaluatingPerformancewithClassImbalance3. 4.11.3FindinganOptimalScoreThreshold4. 4.11.4AggregateEvaluationofPerformance

a. ROCCurveb. Precision-RecallCurve

L. 4.12MulticlassProblemM. 4.13BibliographicNotesN. BibliographyO. 4.14Exercises

9. 5AssociationAnalysis:BasicConceptsandAlgorithmsA. 5.1PreliminariesB. 5.2FrequentItemsetGeneration

1. 5.2.1TheAprioriPrinciple2. 5.2.2FrequentItemsetGenerationintheAprioriAlgorithm3. 5.2.3CandidateGenerationandPruning

a. CandidateGenerationa. Brute-ForceMethodb. Methodc. Method

b. CandidatePruning

4. 5.2.4SupportCountinga. SupportCountingUsingaHashTree*

5. 5.2.5ComputationalComplexity

C. 5.3RuleGeneration1. 5.3.1Confidence-BasedPruning2. 5.3.2RuleGenerationinAprioriAlgorithm3. 5.3.3AnExample:CongressionalVotingRecords

D. 5.4CompactRepresentationofFrequentItemsets1. 5.4.1MaximalFrequentItemsets2. 5.4.2ClosedItemsets

Fk−1×F1Fk−1×Fk−1

E. 5.5AlternativeMethodsforGeneratingFrequentItemsets*F. 5.6FP-GrowthAlgorithm*

1. 5.6.1FP-TreeRepresentation2. 5.6.2FrequentItemsetGenerationinFP-GrowthAlgorithm

G. 5.7EvaluationofAssociationPatterns1. 5.7.1ObjectiveMeasuresofInterestingness

a. AlternativeObjectiveInterestingnessMeasuresb. ConsistencyamongObjectiveMeasuresc. PropertiesofObjectiveMeasures

a. InversionPropertyb. ScalingPropertyc. NullAdditionProperty

d. AsymmetricInterestingnessMeasures

2. 5.7.2MeasuresbeyondPairsofBinaryVariables3. 5.7.3Simpson'sParadox

H. 5.8EffectofSkewedSupportDistributionI. 5.9BibliographicNotesJ. BibliographyK. 5.10Exercises

10. 6AssociationAnalysis:AdvancedConceptsA. 6.1HandlingCategoricalAttributesB. 6.2HandlingContinuousAttributes

1. 6.2.1Discretization-BasedMethods2. 6.2.2Statistics-BasedMethods

a. RuleGenerationb. RuleValidation

3. 6.2.3Non-discretizationMethods

C. 6.3HandlingaConceptHierarchyD. 6.4SequentialPatterns

1. 6.4.1Preliminariesa. Sequencesb. Subsequences

2. 6.4.2SequentialPatternDiscovery3. 6.4.3TimingConstraints*

a. ThemaxspanConstraintb. ThemingapandmaxgapConstraintsc. TheWindowSizeConstraint

4. 6.4.4AlternativeCountingSchemes*

E. 6.5SubgraphPatterns1. 6.5.1Preliminaries

a. Graphsb. GraphIsomorphismc. Subgraphs

2. 6.5.2FrequentSubgraphMining3. 6.5.3CandidateGeneration4. 6.5.4CandidatePruning5. 6.5.5SupportCounting

F. 6.6InfrequentPatterns*1. 6.6.1NegativePatterns2. 6.6.2NegativelyCorrelatedPatterns

3.6.6.3ComparisonsamongInfrequentPatterns,NegativePatterns,andNegativelyCorrelatedPatterns

4. 6.6.4TechniquesforMiningInterestingInfrequentPatterns5. 6.6.5TechniquesBasedonMiningNegativePatterns6. 6.6.6TechniquesBasedonSupportExpectation

a. SupportExpectationBasedonConceptHierarchyb. SupportExpectationBasedonIndirectAssociation

G. 6.7BibliographicNotesH. BibliographyI. 6.8Exercises

11. 7ClusterAnalysis:BasicConceptsandAlgorithmsA. 7.1Overview

1. 7.1.1WhatIsClusterAnalysis?2. 7.1.2DifferentTypesofClusterings3. 7.1.3DifferentTypesofClusters

a. RoadMap

B. 7.2K-means1. 7.2.1TheBasicK-meansAlgorithm

a. AssigningPointstotheClosestCentroidb. CentroidsandObjectiveFunctions

a. DatainEuclideanSpaceb. DocumentDatac. TheGeneralCase

c. ChoosingInitialCentroidsa. K-means++

d. TimeandSpaceComplexity

2. 7.2.2K-means:AdditionalIssuesa. HandlingEmptyClustersb. Outliersc. ReducingtheSSEwithPostprocessingd. UpdatingCentroidsIncrementally

3. 7.2.3BisectingK-means4. 7.2.4K-meansandDifferentTypesofClusters5. 7.2.5StrengthsandWeaknesses6. 7.2.6K-meansasanOptimizationProblem

a. DerivationofK-meansasanAlgorithmtoMinimizetheSSEb. DerivationofK-meansforSAE

C. 7.3AgglomerativeHierarchicalClustering1. 7.3.1BasicAgglomerativeHierarchicalClusteringAlgorithm

a. DefiningProximitybetweenClustersb. TimeandSpaceComplexity

2. 7.3.2SpecificTechniquesa. SampleDatab. SingleLinkorMINc. CompleteLinkorMAXorCLIQUEd. GroupAveragee. Ward’sMethodandCentroidMethods

3. 7.3.3TheLance-WilliamsFormulaforClusterProximity4. 7.3.4KeyIssuesinHierarchicalClustering

a. LackofaGlobalObjectiveFunctionb. AbilitytoHandleDifferentClusterSizesc. MergingDecisionsAreFinal

5. 7.3.5Outliers6. 7.3.6StrengthsandWeaknesses

D. 7.4DBSCAN1. 7.4.1TraditionalDensity:Center-BasedApproach

a. ClassificationofPointsAccordingtoCenter-BasedDensity

2. 7.4.2TheDBSCANAlgorithma. TimeandSpaceComplexityb. SelectionofDBSCANParametersc. ClustersofVaryingDensityd. AnExample

3. 7.4.3StrengthsandWeaknesses

E. 7.5ClusterEvaluation1. 7.5.1Overview

2.7.5.2UnsupervisedClusterEvaluationUsingCohesionandSeparationa. Graph-BasedViewofCohesionandSeparationb. Prototype-BasedViewofCohesionandSeparation

c.RelationshipbetweenPrototype-BasedCohesionandGraph-BasedCohesion

d.RelationshipoftheTwoApproachestoPrototype-BasedSeparation

e. RelationshipbetweenCohesionandSeparationf. RelationshipbetweenGraph-andCentroid-BasedCohesiong. OverallMeasuresofCohesionandSeparationh. EvaluatingIndividualClustersandObjectsi. TheSilhouetteCoefficient

3.7.5.3UnsupervisedClusterEvaluationUsingtheProximityMatrix

a.GeneralCommentsonUnsupervisedClusterEvaluationMeasures

b. MeasuringClusterValidityviaCorrelationc. JudgingaClusteringVisuallybyItsSimilarityMatrix

4. 7.5.4UnsupervisedEvaluationofHierarchicalClustering5. 7.5.5DeterminingtheCorrectNumberofClusters6. 7.5.6ClusteringTendency7. 7.5.7SupervisedMeasuresofClusterValidity

a. Classification-OrientedMeasuresofClusterValidityb. Similarity-OrientedMeasuresofClusterValidityc. ClusterValidityforHierarchicalClusterings

8. 7.5.8AssessingtheSignificanceofClusterValidityMeasures9. 7.5.9ChoosingaClusterValidityMeasure

F. 7.6BibliographicNotesG. BibliographyH. 7.7Exercises

12. 8ClusterAnalysis:AdditionalIssuesandAlgorithmsA. 8.1CharacteristicsofData,Clusters,andClusteringAlgorithms

1. 8.1.1Example:ComparingK-meansandDBSCAN2. 8.1.2DataCharacteristics3. 8.1.3ClusterCharacteristics4. 8.1.4GeneralCharacteristicsofClusteringAlgorithms

a. RoadMap

B. 8.2Prototype-BasedClustering

1. 8.2.1FuzzyClusteringa. FuzzySetsb. FuzzyClustersc. Fuzzyc-means

a. ComputingSSEb. Initializationc. ComputingCentroidsd. UpdatingtheFuzzyPseudo-partition

d. StrengthsandLimitations

2. 8.2.2ClusteringUsingMixtureModelsa. MixtureModelsb. EstimatingModelParametersUsingMaximumLikelihood

c.EstimatingMixtureModelParametersUsingMaximumLikelihood:TheEMAlgorithm

d.AdvantagesandLimitationsofMixtureModelClusteringUsingtheEMAlgorithm

3. 8.2.3Self-OrganizingMaps(SOM)a. TheSOMAlgorithm

a. Initializationb. SelectionofanObjectc. Assignmentd. Updatee. Termination

b. Applicationsc. StrengthsandLimitations

C. 8.3Density-BasedClustering

1. 8.3.1Grid-BasedClusteringa. DefiningGridCellsb. TheDensityofGridCellsc. FormingClustersfromDenseGridCellsd. StrengthsandLimitations

2. 8.3.2SubspaceClusteringa. CLIQUEb. StrengthsandLimitationsofCLIQUE

3.8.3.3DENCLUE:AKernel-BasedSchemeforDensity-BasedClusteringa. KernelDensityEstimationb. ImplementationIssuesc. StrengthsandLimitationsofDENCLUE

D. 8.4Graph-BasedClustering1. 8.4.1Sparsification2. 8.4.2MinimumSpanningTree(MST)Clustering

3.8.4.3OPOSSUM:OptimalPartitioningofSparseSimilaritiesUsingMETISa. StrengthsandWeaknesses

4.8.4.4Chameleon:HierarchicalClusteringwithDynamicModelinga. DecidingWhichClusterstoMergeb. ChameleonAlgorithm

a. Sparsification

c. GraphPartitioninga. AgglomerativeHierarchicalClustering

b. Complexity

d. StrengthsandLimitations

5. 8.4.5SpectralClustering

a.RelationshipbetweenSpectralClusteringandGraphPartitioning

b. StrengthsandLimitations

6. 8.4.6SharedNearestNeighborSimilarity

a.ProblemswithTraditionalSimilarityinHigh-DimensionalData

b. ProblemswithDifferencesinDensityc. SNNSimilarityComputationd. SNNSimilarityversusDirectSimilarity

7. 8.4.7TheJarvis-PatrickClusteringAlgorithma. StrengthsandLimitations

8. 8.4.8SNNDensity9. 8.4.9SNNDensity-BasedClustering

a. TheSNNDensity-basedClusteringAlgorithmb. StrengthsandLimitations

E. 8.5ScalableClusteringAlgorithms1. 8.5.1Scalability:GeneralIssuesandApproaches2. 8.5.2BIRCH3. 8.5.3CURE

a. SamplinginCUREb. Partitioning

F. 8.6WhichClusteringAlgorithm?G. 8.7BibliographicNotesH. BibliographyI. 8.8Exercises

13. 9AnomalyDetectionA. 9.1CharacteristicsofAnomalyDetectionProblems

1. 9.1.1ADefinitionofanAnomaly2. 9.1.2NatureofData3. 9.1.3HowAnomalyDetectionisUsed

B. 9.2CharacteristicsofAnomalyDetectionMethodsC. 9.3StatisticalApproaches

1. 9.3.1UsingParametricModelsa. UsingtheUnivariateGaussianDistributionb. UsingtheMultivariateGaussianDistribution

2. 9.3.2UsingNon-parametricModels3. 9.3.3ModelingNormalandAnomalousClasses4. 9.3.4AssessingStatisticalSignificance5. 9.3.5StrengthsandWeaknesses

D. 9.4Proximity-basedApproaches1. 9.4.1Distance-basedAnomalyScore2. 9.4.2Density-basedAnomalyScore3. 9.4.3RelativeDensity-basedAnomalyScore4. 9.4.4StrengthsandWeaknesses

E. 9.5Clustering-basedApproaches1. 9.5.1FindingAnomalousClusters2. 9.5.2FindingAnomalousInstances

a.AssessingtheExtenttoWhichanObjectBelongstoaCluster

b. ImpactofOutliersontheInitialClusteringc. TheNumberofClusterstoUse

3. 9.5.3StrengthsandWeaknesses

F. 9.6Reconstruction-basedApproaches1. 9.6.1StrengthsandWeaknesses

G. 9.7One-classClassification1. 9.7.1UseofKernels2. 9.7.2TheOriginTrick3. 9.7.3StrengthsandWeaknesses

H. 9.8InformationTheoreticApproaches1. 9.8.1StrengthsandWeaknesses

I. 9.9EvaluationofAnomalyDetectionJ. 9.10BibliographicNotesK. BibliographyL. 9.11Exercises

14. 10AvoidingFalseDiscoveriesA. 10.1Preliminaries:StatisticalTesting

1. 10.1.1SignificanceTestinga. NullHypothesisb. TestStatisticc. NullDistributiond. AssessingStatisticalSignificance

2. 10.1.2HypothesisTestinga. CriticalRegionb. TypeIandTypeIIErrorsc. EffectSize

3. 10.1.3MultipleHypothesisTestinga. Family-wiseErrorRate(FWER)b. BonferroniProcedurec. Falsediscoveryrate(FDR)d. Benjamini-HochbergProcedure

4. 10.1.4PitfallsinStatisticalTesting

B. 10.2ModelingNullandAlternativeDistributions1. 10.2.1GeneratingSyntheticDataSets2. 10.2.2RandomizingClassLabels3. 10.2.3ResamplingInstances4. 10.2.4ModelingtheDistributionoftheTestStatistic

C. 10.3StatisticalTestingforClassification1. 10.3.1EvaluatingClassificationPerformance2. 10.3.2BinaryClassificationasMultipleHypothesisTesting3. 10.3.3MultipleHypothesisTestinginModelSelection

D. 10.4StatisticalTestingforAssociationAnalysis1. 10.4.1UsingStatisticalModels

a. UsingFisher’sExactTestb. UsingtheChi-SquaredTest

2. 10.4.2UsingRandomizationMethods

E. 10.5StatisticalTestingforClusterAnalysis1. 10.5.1GeneratingaNullDistributionforInternalIndices2. 10.5.2GeneratingaNullDistributionforExternalIndices3. 10.5.3Enrichment

F. 10.6StatisticalTestingforAnomalyDetectionG. 10.7BibliographicNotesH. BibliographyI. 10.8Exercises

15. AuthorIndex16. SubjectIndex17. CopyrightPermissions

ListofIllustrations

1. Figure1.1.2. Figure1.2.3. Figure1.3.4. Figure1.4.5. Figure2.1.6. Figure2.2.7. Figure2.3.8. Figure2.4.9. Figure2.5.10. Figure2.6.11. Figure2.7.12. Figure2.8.13. Figure2.9.

14. Figure2.10.15. Figure2.11.16. Figure2.12.17. Figure2.13.18. Figure2.14.19. Figure2.15.20. Figure2.16.21. Figure2.17.22. Figure2.19.23. Figure2.20.24. Figure2.21.25. Figure2.22.26. Figure3.1.27. Figure3.2.28. Figure3.3.29. Figure3.4.30. Figure3.5.31. Figure3.6.32. Figure3.7.33. Figure3.8.34. Figure3.9.35. Figure3.10.36. Figure3.11.37. Figure3.12.38. Figure3.13.39. Figure3.14.40. Figure3.15.41. Figure3.16.42. Figure3.17.43. Figure3.18.44. Figure3.19.

45. Figure3.20.46. Figure3.21.47. Figure3.22.48. Figure3.23.49. Figure3.24.50. Figure3.25.51. Figure3.26.52. Figure3.27.53. Figure3.28.54. Figure3.29.55. Figure3.30.56. Figure3.31.57. Figure3.32.58. Figure3.33.59. Figure3.34.60. Figure3.35.61. Figure3.36.62. Figure3.37.63. Figure4.1.64. Figure4.2.65. Figure4.3.66. Figure4.4.67. Figure4.5.68. Figure4.6.69. Figure4.7.70. Figure4.8.71. Figure4.9.72. Figure4.10.73. Figure4.11.74. Figure4.12.75. Figure4.13.

76. Figure4.14.77. Figure4.15.78. Figure4.16.79. Figure4.17.80. Figure4.18.81. Figure4.19.82. Figure4.20.83. Figure4.21.84. Figure4.22.85. Figure4.23.86. Figure4.24.87. Figure4.25.88. Figure4.26.89. Figure4.27.90. Figure4.28.91. Figure4.29.92. Figure4.30.93. Figure4.31.94. Figure4.32.95. Figure4.33.96. Figure4.34.97. Figure4.35.98. Figure4.36.99. Figure4.37.100. Figure4.38.101. Figure4.39.102. Figure4.40.103. Figure4.41.104. Figure4.42.105. Figure4.43.106. Figure4.44.

107. Figure4.45.108. Figure4.46.109. Figure4.47.110. Figure4.48.111. Figure4.49.112. Figure4.50.113. Figure4.51.114. Figure4.52.115. Figure4.53.116. Figure4.54.117. Figure4.55.118. Figure4.56.119. Figure4.57.120. Figure4.58.121. Figure4.59.122. Figure5.1.123. Figure5.2.124. Figure5.3.125. Figure5.4.126. Figure5.5.127. Figure5.6.128. Figure5.7.129. Figure5.8.130. Figure5.9.131. Figure5.10.132. Figure5.11.133. Figure5.12.134. Figure5.13.135. Figure5.14.136. Figure5.15.137. Figure5.16.

138. Figure5.17.139. Figure5.18.140. Figure5.19.141. Figure5.20.142. Figure5.21.143. Figure5.22.144. Figure5.23.145. Figure5.24.146. Figure5.25.147. Figure5.26.148. Figure5.27.149. Figure5.28.150. Figure5.29.151. Figure5.30.152. Figure5.31.153. Figure5.32.154. Figure5.33.155. Figure5.34.156. Figure6.1.157. Figure6.2.158. Figure6.3.159. Figure6.4.160. Figure6.5.161. Figure6.6.162. Figure6.7.163. Figure6.8.164. Figure6.9.165. Figure6.10.166. Figure6.11.167. Figure6.12.168. Figure6.13.

169. Figure6.14.170. Figure6.15.171. Figure6.16.172. Figure6.17.173. Figure6.18.174. Figure6.19.175. Figure6.20.176. Figure6.21.177. Figure6.22.178. Figure6.23.179. Figure6.24.180. Figure6.25.181. Figure6.26.182. Figure6.27.183. Figure6.28.184. Figure6.29.185. Figure7.1.186. Figure7.2.187. Figure7.3.188. Figure7.4.189. Figure7.5.190. Figure7.6.191. Figure7.7.192. Figure7.8.193. Figure7.9.194. Figure7.10.195. Figure7.11.196. Figure7.12.197. Figure7.13.198. Figure7.14.199. Figure7.15.

200. Figure7.16.201. Figure7.17.202. Figure7.18.203. Figure7.19.204. Figure7.20.205. Figure7.21.206. Figure7.22.207. Figure7.23.208. Figure7.24.209. Figure7.25.210. Figure7.26.211. Figure7.27.212. Figure7.28.213. Figure7.29.214. Figure7.30.215. Figure7.31.216. Figure7.32.217. Figure7.33.218. Figure7.34.219. Figure7.35.220. Figure7.36.221. Figure7.37.222. Figure7.38.223. Figure7.39.224. Figure7.40.225. Figure7.41.226. Figure8.1.227. Figure8.2.228. Figure8.3.229. Figure8.4.230. Figure8.5.

231. Figure8.6.232. Figure8.7.233. Figure8.8.234. Figure8.9.235. Figure8.10.236. Figure8.11.237. Figure8.12.238. Figure8.13.239. Figure8.14.240. Figure8.15.241. Figure8.16.242. Figure8.17.243. Figure8.18.244. Figure8.19.245. Figure8.20.246. Figure8.21.247. Figure8.22.248. Figure8.23.249. Figure8.24.250. Figure8.25.251. Figure8.26.252. Figure8.27.253. Figure8.28.254. Figure8.29.255. Figure8.30.256. Figure8.31.257. Figure8.32.258. Figure9.1.259. Figure9.2.260. Figure9.3.261. Figure9.4.

262. Figure9.5.263. Figure9.6.264. Figure9.7.265. Figure9.8.266. Figure9.9.267. Figure9.10.268. Figure9.11.269. Figure9.12.270. Figure9.13.271. Figure9.14.272. Figure9.15.273. Figure9.16.274. Figure9.17.275. Figure9.18.276. Figure9.19.277. Figure9.20.278. Figure10.1.279. Figure10.2.280. Figure10.3.281. Figure10.4.282. Figure10.5.283. Figure10.6.284. Figure10.7.285. Figure10.8.286. Figure10.9.287. Figure10.10.288. Figure10.11.289. Figure10.12.290. Figure10.13.

ListofTables

1. Table1.1.Marketbasketdata.2. Table1.2.Collectionofnewsarticles.3. Table2.1.Asampledatasetcontainingstudentinformation.4. Table2.2.Differentattributetypes.5. Table2.3.Transformationsthatdefineattributelevels.6. Table2.4.Datasetcontaininginformationaboutcustomer

purchases.7. Table2.5.Conversionofacategoricalattributetothreebinary

attributes.8. Table2.6.Conversionofacategoricalattributetofiveasymmetric

binaryattributes.9. Table2.7.Similarityanddissimilarityforsimpleattributes10. Table2.8.xandycoordinatesoffourpoints.11. Table2.9.EuclideandistancematrixforTable2.8.12. Table2.10.L1distancematrixforTable2.8.13. Table2.11.L∞distancematrixforTable2.8.14. Table2.12.Propertiesofcosine,correlation,andMinkowski

distancemeasures.15. Table2.13.Similaritybetween(x,y),(x,ys),and(x,yt).16. Table2.14.Entropyforx17. Table2.15.Entropyfory18. Table2.16.Jointentropyforxandy19. Table3.1.Examplesofclassificationtasks.20. Table3.2.Asampledataforthevertebrateclassificationproblem.21. Table3.3.Asampledatafortheloanborrowerclassification

problem.22. Table3.4.Confusionmatrixforabinaryclassificationproblem.23. Table3.5.DatasetforExercise2.

24. Table3.6.DatasetforExercise3.25. Table3.7.ComparingthetestaccuracyofdecisiontreesT10and

T100.26. Table3.8.Comparingtheaccuracyofvariousclassification

methods.27. Table4.1.Exampleofarulesetforthevertebrateclassification

problem.28. Table4.2.Thevertebratedataset.29. Table4.3.Exampleofamutuallyexclusiveandexhaustiverule

set.30. Table4.4.Exampleofdatasetusedtoconstructanensembleof

baggingclassifiers.31. Table4.5.Comparingtheaccuracyofadecisiontreeclassifier

againstthreeensemblemethods.32. Table4.6.Aconfusionmatrixforabinaryclassificationproblemin

whichtheclassesarenotequallyimportant.33. Table4.7.EntriesoftheconfusionmatrixintermsoftheTPR,

TNR,skew,α,andtotalnumberofinstances,N.34. Table4.8.Comparisonofvariousrule-basedclassifiers.35. Table4.9.DatasetforExercise7.36. Table4.10.DatasetforExercise8.37. Table4.11.DatasetforExercise10.38. Table4.12.DatasetforExercise12.39. Table4.13.PosteriorprobabilitiesforExercise16.40. Table5.1.Anexampleofmarketbaskettransactions.41. Table5.2.Abinary0/1representationofmarketbasketdata.42. Table5.3.Listofbinaryattributesfromthe1984UnitedStates

CongressionalVotingRecords.Source:TheUCImachinelearningrepository.

43. Table5.4.Associationrulesextractedfromthe1984UnitedStatesCongressionalVotingRecords.

44. Table5.5.Atransactiondatasetforminingcloseditemsets.45. Table5.6.A2-waycontingencytableforvariablesAandB.46. Table5.7.Beveragepreferencesamongagroupof1000people.47. Table5.8.Informationaboutpeoplewhodrinkteaandpeoplewho

usehoneyintheirbeverage.48. Table5.9.Examplesofobjectivemeasuresfortheitemset{A,B}.49. Table5.10.Exampleofcontingencytables.50. Table5.11.Rankingsofcontingencytablesusingthemeasures

giveninTable5.9.51. Table5.12.Contingencytablesforthepairs{p,q}and{r,s}.52. Table5.13.Thegrade-genderexample.(a)Sampledataofsize100.53. Table5.14.Anexampledemonstratingtheeffectofnulladdition.54. Table5.15.Propertiesofsymmetricmeasures.55. Table5.16.Exampleofathree-dimensionalcontingencytable.56. Table5.17.Atwo-waycontingencytablebetweenthesaleofhigh-

definitiontelevisionandexercisemachine.57. Table5.18.Exampleofathree-waycontingencytable.58. Table5.19.Groupingtheitemsinthecensusdatasetbasedon

theirsupportvalues.59. Table5.20.Exampleofmarketbaskettransactions.60. Table5.21.Marketbaskettransactions.61. Table5.22.Exampleofmarketbaskettransactions.62. Table5.23.Exampleofmarketbaskettransactions.63. Table5.24.AContingencyTable.64. Table5.25.ContingencytablesforExercise20.65. Table6.1.Internetsurveydatawithcategoricalattributes.66. Table6.2.Internetsurveydataafterbinarizingcategoricaland

symmetricbinaryattributes.67. Table6.3.Internetsurveydatawithcontinuousattributes.68. Table6.4.Internetsurveydataafterbinarizingcategoricaland

continuousattributes.

69. Table6.5.AbreakdownofInternetuserswhoparticipatedinonlinechataccordingtotheiragegroup.

70. Table6.6.Document-wordmatrix.71. Table6.7.Examplesillustratingtheconceptofasubsequence.72. Table6.8.Graphrepresentationofentitiesinvariousapplication

domains.73. Table6.9.Atwo-waycontingencytablefortheassociationrule

X→Y.74. Table6.10.Trafficaccidentdataset.75. Table6.11.DatasetforExercise2.76. Table6.12.DatasetforExercise3.77. Table6.13.DatasetforExercise4.78. Table6.14.DatasetforExercise6.79. Table6.15.Exampleofmarketbaskettransactions.80. Table6.16.Exampleofeventsequencesgeneratedbyvarious

sensors.81. Table6.17.ExampleofeventsequencedataforExercise14.82. Table6.18.Exampleofnumericdataset.83. Table7.1.Tableofnotation.84. Table7.2.K-means:Commonchoicesforproximity,centroids,

andobjectivefunctions.85. Table7.3.xy-coordinatesofsixpoints.86. Table7.4.Euclideandistancematrixforsixpoints.87. Table7.5.TableofLance-Williamscoefficientsforcommon

hierarchicalclusteringapproaches.88. Table7.6.Tableofgraph-basedclusterevaluationmeasures.89. Table7.7.Copheneticdistancematrixforsinglelinkanddatain

Table2.14onpage90.90. Table7.8.CopheneticcorrelationcoefficientfordataofTable2.14

andfouragglomerativehierarchicalclusteringtechniques.91. Table7.9.K-meansclusteringresultsfortheLATimesdocument

dataset.92. Table7.10.Idealclustersimilaritymatrix.93. Table7.11.Classsimilaritymatrix.94. Table7.12.Two-waycontingencytablefordeterminingwhether

pairsofobjectsareinthesameclassandsamecluster.95. Table7.13.SimilaritymatrixforExercise16.96. Table7.14.ConfusionmatrixforExercise21.97. Table7.15.TableofclusterlabelsforExercise24.98. Table7.16.SimilaritymatrixforExercise24.99. Table8.1.FirstfewiterationsoftheEMalgorithmforthesimple

example.100. Table8.2.Pointcountsforgridcells.101. Table8.3.Similarityamongdocumentsindifferentsectionsofa

newspaper.102. Table8.4.Twonearestneighborsoffourpoints.103. Table9.1.Samplepairs(c,α),α=prob(|x|≥c)foraGaussian

distributionwithmean0andstandarddeviation1.104. Table9.2.Surveydataofweightandheightof100participants.105. Table10.1.Confusiontableinthecontextofmultiplehypothesis

testing.106. Table10.2.Correspondencebetweenstatisticaltestingconcepts

andclassifierevaluationmeasures107. Table10.3.A2-waycontingencytableforvariablesAandB.108. Table10.4.Beveragepreferencesamongagroupof1000people.109. Table10.5.Beveragepreferencesamongagroupof1000people.110. Table10.6.Contingencytableforananomalydetectionsystem

withdetectionratedandfalsealarmratef.111. Table10.7.Beveragepreferencesamongagroupof100people

(left)and10,000people(right).112. Table10.8.OrderedCollectionofp-values..

Landmarks1. Frontmatter2. StartofContent3. backmatter4. ListofIllustrations5. ListofTables

1. i2. ii3. iii4. iv5. v6. vi7. vii8. viii9. ix10. x11. xi12. xii13. xiii14. xiv15. xv16. xvi17. xvii18. xviii19. xix20. xx21. 122. 2

23. 324. 425. 526. 627. 728. 829. 930. 1031. 1132. 1233. 1334. 1435. 1536. 1637. 1738. 1839. 1940. 2041. 2142. 2243. 2344. 2445. 2546. 2647. 2748. 2849. 2950. 3051. 3152. 3253. 33

54. 3455. 3556. 3657. 3758. 3859. 3960. 4061. 4162. 4263. 4364. 4465. 4566. 4667. 4768. 4869. 4970. 5071. 5172. 5273. 5374. 5475. 5576. 5677. 5778. 5879. 5980. 6081. 6182. 6283. 6384. 64

85. 6586. 6687. 6788. 6889. 6990. 7091. 7192. 7293. 7394. 7495. 7596. 7697. 7798. 7899. 79100. 80101. 81102. 82103. 83104. 84105. 85106. 86107. 87108. 88109. 89110. 90111. 91112. 92113. 93114. 94115. 95

116. 96117. 97118. 98119. 99120. 100121. 101122. 102123. 103124. 104125. 105126. 106127. 107128. 108129. 109130. 110131. 111132. 112133. 113134. 114135. 115136. 116137. 117138. 118139. 119140. 120141. 121142. 122143. 123144. 124145. 125146. 126

147. 127148. 128149. 129150. 130151. 131152. 132153. 133154. 134155. 135156. 136157. 137158. 138159. 139160. 140161. 141162. 142163. 143164. 144165. 145166. 146167. 147168. 148169. 149170. 150171. 151172. 152173. 153174. 154175. 155176. 156177. 157

178. 158179. 159180. 160181. 161182. 162183. 163184. 164185. 165186. 166187. 167188. 168189. 169190. 170191. 171192. 172193. 173194. 174195. 175196. 176197. 177198. 178199. 179200. 180201. 181202. 182203. 183204. 184205. 185206. 186207. 187208. 188

209. 189210. 190211. 191212. 192213. 193214. 194215. 195216. 196217. 197218. 198219. 199220. 200221. 201222. 202223. 203224. 204225. 205226. 206227. 207228. 208229. 209230. 210231. 211232. 212233. 213234. 214235. 215236. 216237. 217238. 218239. 219

240. 220241. 221242. 222243. 223244. 224245. 225246. 226247. 227248. 228249. 229250. 230251. 231252. 232253. 233254. 234255. 235256. 236257. 237258. 238259. 239260. 240261. 241262. 242263. 243264. 244265. 245266. 246267. 247268. 248269. 249270. 250

271. 251272. 252273. 253274. 254275. 255276. 256277. 257278. 258279. 259280. 260281. 261282. 262283. 263284. 264285. 265286. 266287. 267288. 268289. 269290. 270291. 271292. 272293. 273294. 274295. 275296. 276297. 277298. 278299. 279300. 280301. 281

302. 282303. 283304. 284305. 285306. 286307. 287308. 288309. 289310. 290311. 291312. 292313. 293314. 294315. 295316. 296317. 297318. 298319. 299320. 300321. 301322. 302323. 303324. 304325. 305326. 306327. 307328. 308329. 309330. 310331. 311332. 312

333. 313334. 314335. 315336. 316337. 317338. 318339. 319340. 320341. 321342. 322343. 323344. 324345. 325346. 326347. 327348. 328349. 329350. 330351. 331352. 332353. 333354. 334355. 335356. 336357. 337358. 338359. 339360. 340361. 341362. 342363. 343

364. 344365. 345366. 346367. 347368. 348369. 349370. 350371. 351372. 352373. 353374. 354375. 355376. 356377. 357378. 358379. 359380. 360381. 361382. 362383. 363384. 364385. 365386. 366387. 367388. 368389. 369390. 370391. 371392. 372393. 373394. 374

395. 375396. 376397. 377398. 378399. 379400. 380401. 381402. 382403. 383404. 384405. 385406. 386407. 387408. 388409. 389410. 390411. 391412. 392413. 393414. 394415. 395416. 396417. 397418. 398419. 399420. 400421. 401422. 402423. 403424. 404425. 405

426. 406427. 407428. 408429. 409430. 410431. 411432. 412433. 413434. 414435. 415436. 416437. 417438. 418439. 419440. 420441. 421442. 422443. 423444. 424445. 425446. 426447. 427448. 428449. 429450. 430451. 431452. 432453. 433454. 434455. 435456. 436

457. 437458. 438459. 439460. 440461. 441462. 442463. 443464. 444465. 445466. 446467. 447468. 448469. 449470. 450471. 451472. 452473. 453474. 454475. 455476. 456477. 457478. 458479. 459480. 460481. 461482. 462483. 463484. 464485. 465486. 466487. 467

488. 468489. 469490. 470491. 471492. 472493. 473494. 474495. 475496. 476497. 477498. 478499. 479500. 480501. 481502. 482503. 483504. 484505. 485506. 486507. 487508. 488509. 489510. 490511. 491512. 492513. 493514. 494515. 495516. 496517. 497518. 498

519. 499520. 500521. 501522. 502523. 503524. 504525. 505526. 506527. 507528. 508529. 509530. 510531. 511532. 512533. 513534. 514535. 515536. 516537. 517538. 518539. 519540. 520541. 521542. 522543. 523544. 524545. 525546. 526547. 527548. 528549. 529

550. 530551. 531552. 532553. 533554. 534555. 535556. 536557. 537558. 538559. 539560. 540561. 541562. 542563. 543564. 544565. 545566. 546567. 547568. 548569. 549570. 550571. 551572. 552573. 553574. 554575. 555576. 556577. 557578. 558579. 559580. 560

581. 561582. 562583. 563584. 564585. 565586. 566587. 567588. 568589. 569590. 570591. 571592. 572593. 573594. 574595. 575596. 576597. 577598. 578599. 579600. 580601. 581602. 582603. 583604. 584605. 585606. 586607. 587608. 588609. 589610. 590611. 591

612. 592613. 593614. 594615. 595616. 596617. 597618. 598619. 599620. 600621. 601622. 602623. 603624. 604625. 605626. 606627. 607628. 608629. 609630. 610631. 611632. 612633. 613634. 614635. 615636. 616637. 617638. 618639. 619640. 620641. 621642. 622

643. 623644. 624645. 625646. 626647. 627648. 628649. 629650. 630651. 631652. 632653. 633654. 634655. 635656. 636657. 637658. 638659. 639660. 640661. 641662. 642663. 643664. 644665. 645666. 646667. 647668. 648669. 649670. 650671. 651672. 652673. 653

674. 654675. 655676. 656677. 657678. 658679. 659680. 660681. 661682. 662683. 663684. 664685. 665686. 666687. 667688. 668689. 669690. 670691. 671692. 672693. 673694. 674695. 675696. 676697. 677698. 678699. 679700. 680701. 681702. 682703. 683704. 684

705. 685706. 686707. 687708. 688709. 689710. 690711. 691712. 692713. 693714. 694715. 695716. 696717. 697718. 698719. 699720. 700721. 701722. 702723. 703724. 704725. 705726. 706727. 707728. 708729. 709730. 710731. 711732. 712733. 713734. 714735. 715

736. 716737. 717738. 718739. 719740. 720741. 721742. 722743. 723744. 724745. 725746. 726747. 727748. 728749. 729750. 730751. 731752. 732753. 733754. 734755. 735756. 736757. 737758. 738759. 739760. 740761. 741762. 742763. 743764. 744765. 745766. 746

767. 747768. 748769. 749770. 750771. 751772. 752773. 753774. 754775. 755776. 756777. 757778. 758779. 759780. 760781. 761782. 762783. 763784. 764785. 765786. 766787. 767788. 768789. 769790. 770791. 771792. 772793. 773794. 774795. 775796. 776797. 777

798. 778799. 779800. 780801. 781802. 782803. 783804. 784805. 785806. 786807. 787808. 788809. 789810. 790811. 791812. 792813. 793814. 794815. 795816. 796817. 797818. 798819. 799820. 800821. 801822. 802823. 803824. 804825. 805826. 806827. 807828. 808

829. 809830. 810831. 811832. 812833. 813834. 814835. 815836. 816837. 817838. 818839. 819840. 820841. 821842. 822843. 823844. 824845. 825846. 826847. 827848. 828849. 829850. 830851. 831852. 832853. 833854. 834855. 835856. 836857. 837858. 838859. 839

860. 840861. 841862. 842863. 843

  • INTRODUCTION TO DATA MINING
  • INTRODUCTION TO DATA MINING
  • Preface to the Second Edition
    • Overview
    • What is New in the Second Edition?
    • To the Instructor
    • Support Materials
  • Contents
  • 1 Introduction
    • 1.1 What Is Data Mining?
    • 1.2 Motivating Challenges
    • 1.3 The Origins of Data Mining
    • 1.4 Data Mining Tasks
    • 1.5 Scope and Organization of the Book
    • 1.6 Bibliographic Notes
    • Bibliography
    • 1.7 Exercises
  • 2 Data
    • 2.1 Types of Data
      • 2.1.1 Attributes and Measurement
        • What Is an Attribute?
        • The Type of an Attribute
        • The Different Types of Attributes
        • Describing Attributes by the Number of Values
        • Asymmetric Attributes
        • General Comments on Levels of Measurement
      • 2.1.2 Types of Data Sets
        • General Characteristics of Data Sets
          • Dimensionality
          • Distribution
          • Resolution
        • Record Data
          • Transaction or Market Basket Data
          • The Data Matrix
          • The Sparse Data Matrix
        • Graph-Based Data
          • Data with Relationships among Objects
          • Data with Objects That Are Graphs
        • Ordered Data
          • Sequential Transaction Data
          • Time Series Data
          • Sequence Data
          • Spatial and Spatio-Temporal Data
        • Handling Non-Record Data
    • 2.2 Data Quality
      • 2.2.1 Measurement and Data Collection Issues
        • Measurement and Data Collection Errors
        • Noise and Artifacts
        • Precision, Bias, and Accuracy
        • Outliers
        • Missing Values
          • Eliminate Data Objects or Attributes
          • Estimate Missing Values
          • Ignore the Missing Value during Analysis
        • Inconsistent Values
        • Duplicate Data
      • 2.2.2 Issues Related to Applications
    • 2.3 Data Preprocessing
      • 2.3.1 Aggregation
      • 2.3.2 Sampling
        • Sampling Approaches
        • Progressive Sampling
      • 2.3.3 Dimensionality Reduction
        • The Curse of Dimensionality
        • Linear Algebra Techniques for Dimensionality Reduction
      • 2.3.4 Feature Subset Selection
        • An Architecture for Feature Subset Selection
        • Feature Weighting
      • 2.3.5 Feature Creation
        • Feature Extraction
        • Mapping the Data to a New Space
      • 2.3.6 Discretization and Binarization
        • Binarization
        • Discretization of Continuous Attributes
          • Unsupervised Discretization
          • Supervised Discretization
        • Categorical Attributes with Too Many Values
      • 2.3.7 Variable Transformation
        • Simple Functions
        • Normalization or Standardization
    • 2.4 Measures of Similarity and Dissimilarity
      • 2.4.1 Basics
        • Definitions
        • Transformations
      • 2.4.2 Similarity and Dissimilarity between Simple Attributes
      • 2.4.3 Dissimilarities between Data Objects
        • Distances
      • 2.4.4 Similarities between Data Objects
      • 2.4.5 Examples of Proximity Measures
        • Similarity Measures for Binary Data
          • Simple Matching Coefficient
          • Jaccard Coefficient
        • Cosine Similarity
        • Extended Jaccard Coefficient (Tanimoto Coefficient)
        • Correlation
        • Differences Among Measures For Continuous Attributes
      • 2.4.6 Mutual Information
      • 2.4.7 Kernel Functions*
      • 2.4.8 Bregman Divergence*
      • 2.4.9 Issues in Proximity Calculation
        • Standardization and Correlation for Distance Measures
        • Combining Similarities for Heterogeneous Attributes
        • Using Weights
      • 2.4.10 Selecting the Right Proximity Measure
    • 2.5 Bibliographic Notes
    • Bibliography
    • 2.6 Exercises
  • 3 Classification: Basic Concepts and Techniques
    • 3.1 Basic Concepts
    • 3.2 General Framework for Classification
    • 3.3 Decision Tree Classifier
      • 3.3.1 A Basic Algorithm to Build a Decision Tree
        • Hunt's Algorithm
        • Design Issues of Decision Tree Induction
      • 3.3.2 Methods for Expressing Attribute Test Conditions
      • 3.3.3 Measures for Selecting an Attribute Test Condition
        • Impurity Measure for a Single Node
        • Collective Impurity of Child Nodes
        • Identifying the best attribute test condition
        • Splitting of Qualitative Attributes
        • Binary Splitting of Qualitative Attributes
        • Binary Splitting of Quantitative Attributes
        • Gain Ratio
      • 3.3.4 Algorithm for Decision Tree Induction
      • 3.3.5 Example Application: Web Robot Detection
      • 3.3.6 Characteristics of Decision Tree Classifiers
    • 3.4 Model Overfitting
      • 3.4.1 Reasons for Model Overfitting
        • Limited Training Size
        • High Model Complexity
    • 3.5 Model Selection
      • 3.5.1 Using a Validation Set
      • 3.5.2 Incorporating Model Complexity
        • Estimating the Complexity of Decision Trees
        • Minimum Description Length Principle
      • 3.5.3 Estimating Statistical Bounds
      • 3.5.4 Model Selection for Decision Trees
    • 3.6 Model Evaluation
      • 3.6.1 Holdout Method
      • 3.6.2 Cross-Validation
    • 3.7 Presence of Hyper-parameters
      • 3.7.1 Hyper-parameter Selection
      • 3.7.2 Nested Cross-Validation
    • 3.8 Pitfalls of Model Selection and Evaluation
      • 3.8.1 Overlap between Training and Test Sets
      • 3.8.2 Use of Validation Error as Generalization Error
    • 3.9 Model Comparison*
      • 3.9.1 Estimating the Confidence Interval for Accuracy
      • 3.9.2 Comparing the Performance of Two Models
    • 3.10 Bibliographic Notes
    • Bibliography
    • 3.11 Exercises
  • 4 Classification: Alternative Techniques
    • 4.1 Types of Classifiers
    • 4.2 Rule-Based Classifier
      • 4.2.1 How a Rule-Based Classifier Works
      • 4.2.2 Properties of a Rule Set
      • 4.2.3 Direct Methods for Rule Extraction
        • Learn-One-Rule Function
          • Rule Pruning
          • Building the Rule Set
        • Instance Elimination
      • 4.2.4 Indirect Methods for Rule Extraction
      • 4.2.5 Characteristics of Rule-Based Classifiers
    • 4.3 Nearest Neighbor Classifiers
      • 4.3.1 Algorithm
      • 4.3.2 Characteristics of Nearest Neighbor Classifiers
    • 4.4 Naïve Bayes Classifier
      • 4.4.1 Basics of Probability Theory
        • Bayes Theorem
        • Using Bayes Theorem for Classification
      • 4.4.2 Naïve Bayes Assumption
        • Conditional Independence
        • How a Naïve Bayes Classifier Works
        • Estimating Conditional Probabilities for Categorical Attributes
        • Estimating Conditional Probabilities for Continuous Attributes
        • Handling Zero Conditional Probabilities
        • Characteristics of Naïve Bayes Classifiers
    • 4.5 Bayesian Networks
      • 4.5.1 Graphical Representation
        • Conditional Independence
        • Joint Probability
        • Use of Hidden Variables
      • 4.5.2 Inference and Learning
        • Variable Elimination
        • Sum-Product Algorithm for Trees
        • Generalizations for Non-Tree Graphs
        • Learning Model Parameters
      • 4.5.3 Characteristics of Bayesian Networks
    • 4.6 Logistic Regression
      • 4.6.1 Logistic Regression as a Generalized Linear Model
      • 4.6.2 Learning Model Parameters
      • 4.6.3 Characteristics of Logistic Regression
    • 4.7 Artificial Neural Network (ANN)
      • 4.7.1 Perceptron
        • Learning the Perceptron
      • 4.7.2 Multi-layer Neural Network
        • Learning Model Parameters
      • 4.7.3 Characteristics of ANN
    • 4.8 Deep Learning
      • 4.8.1 Using Synergistic Loss Functions
        • Saturation of Outputs
        • Cross entropy loss function
      • 4.8.2 Using Responsive Activation Functions
        • Vanishing Gradient Problem
        • Rectified Linear Units (ReLU)
      • 4.8.3 Regularization
        • Dropout
      • 4.8.4 Initialization of Model Parameters
        • Supervised Pretraining
        • Unsupervised Pretraining
        • Use of Autoencoders
        • Hybrid Pretraining
      • 4.8.5 Characteristics of Deep Learning
    • 4.9 Support Vector Machine (SVM)
      • 4.9.1 Margin of a Separating Hyperplane
        • Rationale for Maximum Margin
      • 4.9.2 Linear SVM
        • Learning Model Parameters
      • 4.9.3 Soft-margin SVM
        • SVM as a Regularizer of Hinge Loss
      • 4.9.4 Nonlinear SVM
        • Attribute Transformation
        • Learning a Nonlinear SVM Model
      • 4.9.5 Characteristics of SVM
    • 4.10 Ensemble Methods
      • 4.10.1 Rationale for Ensemble Method
      • 4.10.2 Methods for Constructing an Ensemble Classifier
      • 4.10.3 Bias-Variance Decomposition
      • 4.10.4 Bagging
      • 4.10.5 Boosting
        • AdaBoost
      • 4.10.6 Random Forests
      • 4.10.7 Empirical Comparison among Ensemble Methods
    • 4.11 Class Imbalance Problem
      • 4.11.1 Building Classifiers with Class Imbalance
        • Oversampling and Undersampling
        • Assigning Scores to Test Instances
      • 4.11.2 Evaluating Performance with Class Imbalance
      • 4.11.3 Finding an Optimal Score Threshold
      • 4.11.4 Aggregate Evaluation of Performance
        • ROC Curve
        • Precision-Recall Curve
    • 4.12 Multiclass Problem
    • 4.13 Bibliographic Notes
    • Bibliography
    • 4.14 Exercises
  • 5 Association Analysis: Basic Concepts and Algorithms
    • 5.1 Preliminaries
    • 5.2 Frequent Itemset Generation
      • 5.2.1 The Apriori Principle
      • 5.2.2 Frequent Itemset Generation in the Apriori Algorithm
      • 5.2.3 Candidate Generation and Pruning
        • Candidate Generation
          • Brute-Force Method
          • Fk−1×F1 Method
          • Fk−1×Fk−1 Method
        • Candidate Pruning
      • 5.2.4 Support Counting
        • Support Counting Using a Hash Tree*
      • 5.2.5 Computational Complexity
    • 5.3 Rule Generation
      • 5.3.1 Confidence-Based Pruning
      • 5.3.2 Rule Generation in Apriori Algorithm
      • 5.3.3 An Example: Congressional Voting Records
    • 5.4 Compact Representation of Frequent Itemsets
      • 5.4.1 Maximal Frequent Itemsets
      • 5.4.2 Closed Itemsets
    • 5.5 Alternative Methods for Generating Frequent Itemsets*
    • 5.6 FP-Growth Algorithm*
      • 5.6.1 FP-Tree Representation
      • 5.6.2 Frequent Itemset Generation in FP-Growth Algorithm
    • 5.7 Evaluation of Association Patterns
      • 5.7.1 Objective Measures of Interestingness
        • Alternative Objective Interestingness Measures
        • Consistency among Objective Measures
        • Properties of Objective Measures
          • Inversion Property
          • Scaling Property
          • Null Addition Property
        • Asymmetric Interestingness Measures
      • 5.7.2 Measures beyond Pairs of Binary Variables
      • 5.7.3 Simpson's Paradox
    • 5.8 Effect of Skewed Support Distribution
    • 5.9 Bibliographic Notes
    • Bibliography
    • 5.10 Exercises
  • 6 Association Analysis: Advanced Concepts
    • 6.1 Handling Categorical Attributes
    • 6.2 Handling Continuous Attributes
      • 6.2.1 Discretization-Based Methods
      • 6.2.2 Statistics-Based Methods
        • Rule Generation
        • Rule Validation
      • 6.2.3 Non-discretization Methods
    • 6.3 Handling a Concept Hierarchy
    • 6.4 Sequential Patterns
      • 6.4.1 Preliminaries
        • Sequences
        • Subsequences
      • 6.4.2 Sequential Pattern Discovery
      • 6.4.3 Timing Constraints*
        • The maxspan Constraint
        • The mingap and maxgap Constraints
        • The Window Size Constraint
      • 6.4.4 Alternative Counting Schemes*
    • 6.5 Subgraph Patterns
      • 6.5.1 Preliminaries
        • Graphs
        • Graph Isomorphism
        • Subgraphs
      • 6.5.2 Frequent Subgraph Mining
      • 6.5.3 Candidate Generation
      • 6.5.4 Candidate Pruning
      • 6.5.5 Support Counting
    • 6.6 Infrequent Patterns*
      • 6.6.1 Negative Patterns
      • 6.6.2 Negatively Correlated Patterns
      • 6.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns
      • 6.6.4 Techniques for Mining Interesting Infrequent Patterns
      • 6.6.5 Techniques Based on Mining Negative Patterns
      • 6.6.6 Techniques Based on Support Expectation
        • Support Expectation Based on Concept Hierarchy
        • Support Expectation Based on Indirect Association
    • 6.7 Bibliographic Notes
    • Bibliography
    • 6.8 Exercises
  • 7 Cluster Analysis: Basic Concepts and Algorithms
    • 7.1 Overview
      • 7.1.1 What Is Cluster Analysis?
      • 7.1.2 Different Types of Clusterings
      • 7.1.3 Different Types of Clusters
        • Road Map
    • 7.2 K-means
      • 7.2.1 The Basic K-means Algorithm
        • Assigning Points to the Closest Centroid
        • Centroids and Objective Functions
          • Data in Euclidean Space
          • Document Data
          • The General Case
        • Choosing Initial Centroids
          • K-means++
        • Time and Space Complexity
      • 7.2.2 K-means: Additional Issues
        • Handling Empty Clusters
        • Outliers
        • Reducing the SSE with Postprocessing
        • Updating Centroids Incrementally
      • 7.2.3 Bisecting K-means
      • 7.2.4 K-means and Different Types of Clusters
      • 7.2.5 Strengths and Weaknesses
      • 7.2.6 K-means as an Optimization Problem
        • Derivation of K-means as an Algorithm to Minimize the SSE
        • Derivation of K-means for SAE
    • 7.3 Agglomerative Hierarchical Clustering
      • 7.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
        • Defining Proximity between Clusters
        • Time and Space Complexity
      • 7.3.2 Specific Techniques
        • Sample Data
        • Single Link or MIN
        • Complete Link or MAX or CLIQUE
        • Group Average
        • Ward’s Method and Centroid Methods
      • 7.3.3 The Lance-Williams Formula for Cluster Proximity
      • 7.3.4 Key Issues in Hierarchical Clustering
        • Lack of a Global Objective Function
        • Ability to Handle Different Cluster Sizes
        • Merging Decisions Are Final
      • 7.3.5 Outliers
      • 7.3.6 Strengths and Weaknesses
    • 7.4 DBSCAN
      • 7.4.1 Traditional Density: Center-Based Approach
        • Classification of Points According to Center-Based Density
      • 7.4.2 The DBSCAN Algorithm
        • Time and Space Complexity
        • Selection of DBSCAN Parameters
        • Clusters of Varying Density
        • An Example
      • 7.4.3 Strengths and Weaknesses
    • 7.5 Cluster Evaluation
      • 7.5.1 Overview
      • 7.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation
        • Graph-Based View of Cohesion and Separation
        • Prototype-Based View of Cohesion and Separation
        • Relationship between Prototype-Based Cohesion and Graph-Based Cohesion
        • Relationship of the Two Approaches to Prototype-Based Separation
        • Relationship between Cohesion and Separation
        • Relationship between Graph- and Centroid-Based Cohesion
        • Overall Measures of Cohesion and Separation
        • Evaluating Individual Clusters and Objects
        • The Silhouette Coefficient
      • 7.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix
        • General Comments on Unsupervised Cluster Evaluation Measures
        • Measuring Cluster Validity via Correlation
        • Judging a Clustering Visually by Its Similarity Matrix
      • 7.5.4 Unsupervised Evaluation of Hierarchical Clustering
      • 7.5.5 Determining the Correct Number of Clusters
      • 7.5.6 Clustering Tendency
      • 7.5.7 Supervised Measures of Cluster Validity
        • Classification-Oriented Measures of Cluster Validity
        • Similarity-Oriented Measures of Cluster Validity
        • Cluster Validity for Hierarchical Clusterings
      • 7.5.8 Assessing the Significance of Cluster Validity Measures
      • 7.5.9 Choosing a Cluster Validity Measure
    • 7.6 Bibliographic Notes
    • Bibliography
    • 7.7 Exercises
  • 8 Cluster Analysis: Additional Issues and Algorithms
    • 8.1 Characteristics of Data, Clusters, and Clustering Algorithms
      • 8.1.1 Example: Comparing K-means and DBSCAN
      • 8.1.2 Data Characteristics
      • 8.1.3 Cluster Characteristics
      • 8.1.4 General Characteristics of Clustering Algorithms
        • Road Map
    • 8.2 Prototype-Based Clustering
      • 8.2.1 Fuzzy Clustering
        • Fuzzy Sets
        • Fuzzy Clusters
        • Fuzzy c-means
          • Computing SSE
          • Initialization
          • Computing Centroids
          • Updating the Fuzzy Pseudo-partition
        • Strengths and Limitations
      • 8.2.2 Clustering Using Mixture Models
        • Mixture Models
        • Estimating Model Parameters Using Maximum Likelihood
        • Estimating Mixture Model Parameters Using Maximum Likelihood: The EM Algorithm
        • Advantages and Limitations of Mixture Model Clustering Using the EM Algorithm
      • 8.2.3 Self-Organizing Maps (SOM)
        • The SOM Algorithm
          • Initialization
          • Selection of an Object
          • Assignment
          • Update
          • Termination
        • Applications
        • Strengths and Limitations
    • 8.3 Density-Based Clustering
      • 8.3.1 Grid-Based Clustering
        • Defining Grid Cells
        • The Density of Grid Cells
        • Forming Clusters from Dense Grid Cells
        • Strengths and Limitations
      • 8.3.2 Subspace Clustering
        • CLIQUE
        • Strengths and Limitations of CLIQUE
      • 8.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering
        • Kernel Density Estimation
        • Implementation Issues
        • Strengths and Limitations of DENCLUE
    • 8.4 Graph-Based Clustering
      • 8.4.1 Sparsification
      • 8.4.2 Minimum Spanning Tree (MST) Clustering
      • 8.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS
        • Strengths and Weaknesses
      • 8.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling
        • Deciding Which Clusters to Merge
        • Chameleon Algorithm
          • Sparsification
        • Graph Partitioning
          • Agglomerative Hierarchical Clustering
          • Complexity
        • Strengths and Limitations
      • 8.4.5 Spectral Clustering
        • Relationship between Spectral Clustering and Graph Partitioning
        • Strengths and Limitations
      • 8.4.6 Shared Nearest Neighbor Similarity
        • Problems with Traditional Similarity in High-Dimensional Data
        • Problems with Differences in Density
        • SNN Similarity Computation
        • SNN Similarity versus Direct Similarity
      • 8.4.7 The Jarvis-Patrick Clustering Algorithm
        • Strengths and Limitations
      • 8.4.8 SNN Density
      • 8.4.9 SNN Density-Based Clustering
        • The SNN Density-based Clustering Algorithm
        • Strengths and Limitations
    • 8.5 Scalable Clustering Algorithms
      • 8.5.1 Scalability: General Issues and Approaches
      • 8.5.2 BIRCH
      • 8.5.3 CURE
        • Sampling in CURE
        • Partitioning
    • 8.6 Which Clustering Algorithm?
    • 8.7 Bibliographic Notes
    • Bibliography
    • 8.8 Exercises
  • 9 Anomaly Detection
    • 9.1 Characteristics of Anomaly Detection Problems
      • 9.1.1 A Definition of an Anomaly
      • 9.1.2 Nature of Data
      • 9.1.3 How Anomaly Detection is Used
    • 9.2 Characteristics of Anomaly Detection Methods
    • 9.3 Statistical Approaches
      • 9.3.1 Using Parametric Models
        • Using the Univariate Gaussian Distribution
        • Using the Multivariate Gaussian Distribution
      • 9.3.2 Using Non-parametric Models
      • 9.3.3 Modeling Normal and Anomalous Classes
      • 9.3.4 Assessing Statistical Significance
      • 9.3.5 Strengths and Weaknesses
    • 9.4 Proximity-based Approaches
      • 9.4.1 Distance-based Anomaly Score
      • 9.4.2 Density-based Anomaly Score
      • 9.4.3 Relative Density-based Anomaly Score
      • 9.4.4 Strengths and Weaknesses
    • 9.5 Clustering-based Approaches
      • 9.5.1 Finding Anomalous Clusters
      • 9.5.2 Finding Anomalous Instances
        • Assessing the Extent to Which an Object Belongs to a Cluster
        • Impact of Outliers on the Initial Clustering
        • The Number of Clusters to Use
      • 9.5.3 Strengths and Weaknesses
    • 9.6 Reconstruction-based Approaches
      • 9.6.1 Strengths and Weaknesses
    • 9.7 One-class Classification
      • 9.7.1 Use of Kernels
      • 9.7.2 The Origin Trick
      • 9.7.3 Strengths and Weaknesses
    • 9.8 Information Theoretic Approaches
      • 9.8.1 Strengths and Weaknesses
    • 9.9 Evaluation of Anomaly Detection
    • 9.10 Bibliographic Notes
    • Bibliography
    • 9.11 Exercises
  • 10 Avoiding False Discoveries
    • 10.1 Preliminaries: Statistical Testing
      • 10.1.1 Significance Testing
        • Null Hypothesis
        • Test Statistic
        • Null Distribution
        • Assessing Statistical Significance
      • 10.1.2 Hypothesis Testing
        • Critical Region
        • Type I and Type II Errors
        • Effect Size
      • 10.1.3 Multiple Hypothesis Testing
        • Family-wise Error Rate (FWER)
        • Bonferroni Procedure
        • False discovery rate (FDR)
        • Benjamini-Hochberg Procedure
      • 10.1.4 Pitfalls in Statistical Testing
    • 10.2 Modeling Null and Alternative Distributions
      • 10.2.1 Generating Synthetic Data Sets
      • 10.2.2 Randomizing Class Labels
      • 10.2.3 Resampling Instances
      • 10.2.4 Modeling the Distribution of the Test Statistic
    • 10.3 Statistical Testing for Classification
      • 10.3.1 Evaluating Classification Performance
      • 10.3.2 Binary Classification as Multiple Hypothesis Testing
      • 10.3.3 Multiple Hypothesis Testing in Model Selection
    • 10.4 Statistical Testing for Association Analysis
      • 10.4.1 Using Statistical Models
        • Using Fisher’s Exact Test
        • Using the Chi-Squared Test
      • 10.4.2 Using Randomization Methods
    • 10.5 Statistical Testing for Cluster Analysis
      • 10.5.1 Generating a Null Distribution for Internal Indices
      • 10.5.2 Generating a Null Distribution for External Indices
      • 10.5.3 Enrichment
    • 10.6 Statistical Testing for Anomaly Detection
    • 10.7 Bibliographic Notes
    • Bibliography
    • 10.8 Exercises
  • Author Index
  • Subject Index
  • Copyright Permissions