**2019 | 53 views | 76 Pages | 365.12 KB**

PRIVACY-PRESERVING DATA MINING: MODELS AND ALGORITHMS Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Hawthorne, NY 10532 PHILIP S. YU

PRIVACY PRESERVING DATA MINING

MODELS AND ALGORITHMS

CHARU C AGGARWAL

IBM T J Watson Research Center Hawthorne NY 10532

PHILIP S YU

University of Illinois at Chicago Chicago IL 60607

Kluwer Academic Publishers

Boston Dordrecht London

List of Figures xv

List of Tables xx

Preface xxi

An Introduction to Privacy Preserving Data Mining 1

Charu C Aggarwal Philip S Yu

1 Introduction 1

2 Privacy Preserving Data Mining Algorithms 3

3 Conclusions and Summary 7

References 8

A General Survey of Privacy Preserving Data Mining Models and Algorithms 11

Charu C Aggarwal Philip S Yu

1 Introduction 11

2 The Randomization Method 13

2 1 Privacy Quantification 15

2 2 Adversarial Attacks on Randomization 17

2 3 Randomization Methods for Data Streams 18

2 4 Multiplicative Perturbations 18

2 5 Data Swapping 19

3 Group Based Anonymization 20

3 1 The k Anonymity Framework 20

3 2 Personalized Privacy Preservation 23

3 3 Utility Based Privacy Preservation 24

3 4 Sequential Releases 25

3 5 The l diversity Method 26

3 6 The t closeness Model 27

3 7 Models for Text Binary and String Data 27

4 Distributed Privacy Preserving Data Mining 28

4 1 Distributed Algorithms over Horizontally Partitioned Data

4 2 Distributed Algorithms over Vertically Partitioned Data 31

4 3 Distributed Algorithms for k Anonymity 31

5 Privacy Preservation of Application Results 32

5 1 Association Rule Hiding 32

5 2 Downgrading Classifier Effectiveness 34

vi PRIVACY PRESERVING DATA MINING MODELS AND ALGORITHMS

5 3 Query Auditing and Inference Control 34

6 Limitations of Privacy The Curse of Dimensionality 37

7 Applications of Privacy Preserving Data Mining 38

7 1 Medical Databases The Scrub and Datafly Systems 38

7 2 Bioterrorism Applications 40

7 3 Homeland Security Applications 40

7 4 Genomic Privacy 42

8 Summary 42

References 43

A Survey of Inference Control Methods for Privacy Preserving Data Mining 53

Josep Domingo Ferrer

1 A classification of microdata protection methods 55

2 Perturbative masking methods 58

2 1 Additive noise 58

2 2 Microaggregation 59

2 3 Data swapping and rank swapping 61

2 4 Rounding 62

2 5 Resampling 62

2 6 PRAM 62

2 7 MASSC 63

3 Non perturbative masking methods 63

3 1 Sampling 64

3 2 Global recoding 64

3 3 Top and bottom coding 65

3 4 Local suppression 65

4 Synthetic microdata generation 65

4 1 Synthetic data by multiple imputation 65

4 2 Synthetic data by bootstrap 66

4 3 Synthetic data by Latin Hypercube Sampling 66

4 4 Partially synthetic data by Cholesky decomposition 67

4 5 Other partially synthetic and hybrid microdata approaches 67

4 6 Pros and cons of synthetic microdata 68

5 Trading off information loss and disclosure risk 69

5 1 Score construction 69

5 2 R U maps 71

5 3 k anonymity 71

6 Conclusions and research directions 72

References 73

Measures of Anonymity 81

Suresh Venkatasubramanian

1 Introduction 81

1 1 What is privacy 81

1 2 Data Anonymization Methods 83

1 3 A Classification Of Methods 84

2 Statistical Measures of Anonymity 85

Contents vii

2 1 Query Restriction 85

2 2 Anonymity via Variance 85

2 3 Anonymity via Multiplicity 86

3 Probabilistic Measures of Anonymity 86

3 1 Measures Based on Random Perturbation 87

3 2 Measures Based on Generalization 90

3 3 Utility vs Privacy 93

4 Computational Measures Of Anonymity 94

4 1 Anonymity via Isolation 96

5 Conclusions And New Directions 97

5 1 New Directions 98

References 98

k Anonymous Data Mining A Survey 103

V Ciriani S De Capitani di Vimercati S Foresti and P Samarati

1 Introduction 103

2 k Anonymity 105

3 Algorithms for Enforcing k Anonymity 108

4 k Anonymity Threats from Data Mining 115

4 1 Association Rules 115

4 2 Classification Mining 116

5 k Anonymity in Data Mining 118

6 Anonymize and Mine 120

7 Mine and Anonymize 123

7 1 Enforcing k Anonymity on Association Rules 124

7 2 Enforcing k Anonymity on Decision Trees 127

8 Conclusions 130

Acknowledgments 131

References 131

A Survey of Randomization Methods for Privacy Preserving Data Mining 135

Charu C Aggarwal Philip S Yu

1 Introduction 135

2 Reconstruction Methods for Randomization 137

2 1 The Bayes Reconstruction Method 137

2 2 The EM Reconstruction Method 139

2 3 Utility and Optimality of Randomization Models 141

3 Applications of Randomization 142

3 1 Privacy Preserving Classification with Randomization 142

3 2 Privacy Preserving OLAP 143

3 3 Collaborative Filtering 143

4 The Privacy Information Loss Tradeoff 144

5 Vulnerabilities of the Randomization Method 147

6 Randomization of Time Series Data Streams 149

7 Multiplicative Noise for Randomization 150

7 1 Vulnerabilities of Multiplicative Randomization 151

viii PRIVACY PRESERVING DATA MINING MODELS AND ALGORITHMS

7 2 Sketch Based Randomization 151

8 Conclusions and Summary 152

References 152

A Survey of Multiplicative 155

Perturbation for

Privacy Preserving Data Mining

Keke Chen and Ling Liu

1 Introduction 156

1 1 Data Privacy vs Data Utility 157

1 2 Outline 158

2 Definition of Multiplicative Perturbation 159

2 1 Notations 159

2 2 Rotation Perturbation 159

2 3 Projection Perturbation 160

2 4 Sketch based Approach 162

2 5 Geometric Perturbation 162

3 Transformation Invariant Data Mining Models 163

3 1 Definition of Transformation Invariant Models 163

3 2 Transformation Invariant Classification Models 164

3 3 Transformation Invariant Clustering Models 165

4 Privacy Evaluation for Multiplicative Perturbation 166

4 1 A Conceptual Multidimensional Privacy Evaluation Model 166

4 2 Variance of Difference as Column Privacy Metric 167

4 3 Incorporating Attack Evaluation 168

4 4 Other Metrics 168

5 Attack Resilient Multiplicative Perturbations 169

5 1 Naive Estimation to Rotation Perturbation 169

5 2 ICA Based Attacks 171

5 3 Distance Inference Attacks 172

5 4 Attacks with More Prior Knowledge 174

5 5 Finding Attack Resilient Perturbations 174

6 Conclusion 176

References 176

A Survey of Quantification of Privacy Preserving Data Mining Algorithms 181

Elisa Bertino and Dan Lin and Wei Jiang

1 Metrics for Quantifying Privacy Level 184

1 1 Data Privacy 184

1 2 Result Privacy 189

2 Metrics for Quantifying Hiding Failure 190

3 Metrics for Quantifying Data Quality 191

3 1 Quality of the Data Resulting from the PPDM Process 191

3 2 Quality of the Data Mining Results 196

4 Complexity Metrics 198

5 How to Select a Proper Metric 199

6 Conclusion and Research Directions 200

References 200

Contents ix

A Survey of Utility based 205

Privacy Preserving Data

Transformation Methods

Ming Hua and Jian Pei

1 Introduction 206

1 1 What is Utility based Privacy Preservation 207

2 Types of Utility based Privacy Preservation Methods 208

2 1 Privacy Models 208

2 2 Utility Measures 210

2 3 Summary of the Utility Based Privacy Preserving Methods 212

3 Utility Based Anonymization Using Local Recoding 212

3 1 Global Recoding and Local Recoding 213

3 2 Utility Measure 214

3 3 Anonymization Methods 215

3 4 Summary and Discussion 217

4 The Utility based Privacy Preserving Methods in Classification Prob

4 1 The Top Down Specialization Method 218

4 2 The Progressive Disclosure Algorithm 222

4 3 Summary and Discussion 226

5 Anonymized Marginal Injecting Utility into Anonymized Data Sets 226

5 1 Anonymized Marginal 227

5 2 Utility Measure 228

5 3 Injecting Utility Using Anonymized Marginals 229

5 4 Summary and Discussion 231

6 Summary 232

References 232

Mining Association Rules under Privacy Constraints 237

Jayant R Haritsa

1 Problem Framework 238

2 Evolution of the Literature 244

3 The FRAPP Framework 249

4 Sample Results 257

5 Closing Remarks 261

References 261

A Survey of Association Rule Hiding Methods for Privacy 265

Vassilios S Verykios and Aris Gkoulalas Divanis

1 Introduction 265

2 Terminology and Preliminaries 267

3 Taxonomy of Association Rule Hiding Algorithms 268

4 Classes of Association Rule Algorithms 269

4 1 Heuristic Approaches 270

4 2 Border based Approaches 275

4 3 Exact Approaches 276

x PRIVACY PRESERVING DATA MINING MODELS AND ALGORITHMS

5 Other Hiding Approaches 277

6 Metrics and Performance Analysis 279

7 Discussion and Future Trends 282

8 Conclusions 283

References 284

A Survey of Statistical 289

Approaches to Preserving

Confidentiality of Contingency

Table Entries

Stephen E Fienberg and Aleksandra B Slavkovic

1 Introduction 289

2 The Statistical Approach Privacy Protection 290

3 Datamining Algorithms Association Rules and Disclosure Limita

4 Estimation and Disclosure Limitation for Multi way Contingency

Tables 293

5 Two Illustrative Examples 299

5 1 Example 1 Data from a Randomized Clinical Trial 299

5 2 Example 2 Data from the 1993 U S Current Population

Survey 303

6 Conclusions 306

References 307

A Survey of 311

Privacy Preserving Methods

Across Horizontally Partitioned

Murat Kantarcioglu

1 Introduction 311

2 Basic Cryptographic Techniques for Privacy Preserving Distributed

Data Mining 313

3 Common Secure Sub protocols Used in Privacy Preserving Distributed

Data Mining 316

4 Privacy preserving Distributed Data Mining on Horizontally Parti

tioned Data 321

5 Comparison to Vertically Partitioned Data Model 324

6 Extension to Malicious Parties 325

7 Limitations of the Cryptographic Techniques Used in Privacy Preserving

Distributed Data Mining 327

8 Privacy Issues Related to Data Mining Results 328

9 Conclusion 330

References 330

A Survey of 335

Privacy Preserving Methods

across Vertically Partitioned

Contents xi

Jaideep Vaidya

1 Classification 337

1 1 Na ve Bayes Classification 340

1 2 Bayesian Network Structure Learning 341

1 3 Decision Tree Classification 342

2 Clustering 344

3 Association Rule Mining 345

4 Outlier detection 347

4 1 Algorithm 349

4 2 Security Analysis 350

4 3 Computation and Communication Analysis 352

5 Challenges and Research Directions 353

References 354

A Survey of Attack Techniques on Privacy Preserving Data Perturbation 357

Kun Liu Chris Giannella and Hillol Kargupta

1 Introduction 358

2 Definitions and Notation 358

3 Attacking Additive Data Perturbation 359

3 1 Eigen Analysis and PCA Preliminaries 360

3 2 Spectral Filtering 361

3 3 SVD Filtering 362

3 4 PCA Filtering 363

3 5 MAP Estimation Attack 364

3 6 Distribution Analysis Attack 365

3 7 Summary 366

4 Attacking Matrix Multiplicative Data Perturbation 367

4 1 Known I O Attacks 368

4 2 Known Sample Attack 371

4 3 Other Attacks Based on ICA 372

4 4 Summary 373

5 Attacking k Anonymization 374

6 Conclusion 374

Acknowledgments 375

References 375

Private Data Analysis via 381

Output Perturbation

Kobbi Nissim

1 The Abstract Model Statistical Databases Queries and Sanitizers 383

2 Privacy 386

2 1 Interpreting the Privacy Definition 388

3 The Basic Technique Calibrating Noise to Sensitivity 392

3 1 Applications Functions with Low Global Sensitivity 394

4 Constructing Sanitizers for Complex Functionalities 398

4 1 k Means Clustering 399

xii PRIVACY PRESERVING DATA MINING MODELS AND ALGORITHMS

4 2 SVD and PCA 401

4 3 Learning in the Statistical Queries Model 401

5 Beyond the Basics 403

5 1 Instance Based Noise and Smooth Sensitivity 403

5 2 The Sample Aggregate Framework 405

5 3 A General Sanitization Mechanism 406

6 Related Work and Bibliographic Notes 407

References 408

A Survey of Query Auditing Techniques for Data Privacy 413

Shubha U Nabar Krishnaram Kenthapadi Nina Mishra and Rajeev Motwani

1 Auditing Aggregate Queries 414

1 1 Offline Auditing 415

1 2 Online Auditing 416

2 Auditing Select Project Join Queries 424

3 Challenges in Auditing 425

4 Reading 427

References 428

Privacy and the Dimensionality Curse 431

Charu C Aggarwal

1 Introduction 431

2 The Dimensionality Curse and the k anonymity Method 434

3 The Dimensionality Curse and Condensation 439

4 The Dimensionality Curse and the Randomization Method 444

4 1 Effects of Public Information 445

4 2 Effects of High Dimensionality 448

4 3 Gaussian Perturbing Distribution 448

4 4 Uniform Perturbing Distribution 454

5 The Dimensionality Curse and l diversity 457

6 Conclusions and Research Directions 458

References 458

Personalized Privacy Preservation 461

Yufei Tao and Xiaokui Xiao

1 Introduction 461

2 Formalization of Personalized Anonymity 464

2 1 Personal Privacy Requirements 464

2 2 Generalization 465

3 Combinatorial Process of Privacy Attack 466

3 1 Primary Case 468

3 2 Non primary Case 469

4 Theoretical Foundation 470

4 1 Notations and Basic Properties 471

4 2 Derivation of the Breach Probability 472

5 Generalization Algorithm 473

Contents xiii

5 1 The Greedy Framework 474

5 2 Optimal SA generalization 476

6 Alternative Forms of Personalized Privacy Preservation 478

6 1 Extension of k anonymity 479

6 2 Personalization in Location Privacy Protection 480

7 Summary and Future Work 482

References 484

Privacy Preserving Data Stream Classification 487

Yabo Xu Ke Wang Ada Wai Chee Fu Rong She and Jian Pei

1 Introduction 487

1 1 Motivating Example 488

1 2 Contributions and Paper Outline 490

2 Related Works 492

3 Problem Statement 493

3 1 Secure Join Stream Classification 493

3 2 Naive Bayesian Classifiers 494

4 Our Approach 495

4 1 Initialization 496

4 2 Bottom Up Propagation 496

4 3 Top Down Propagation 498

4 4 Using NBC 499

4 5 Algorithm Analysis 500

5 Empirical Studies 501

5 1 Real life Datasets 502

5 2 Synthetic Datasets 504

5 3 Discussion 506

6 Conclusions 507

References 508

List of Figures

5 1 Simplified representation of a private table 106

5 2 An example of domain and value generalization hierarchies 107

5 3 Classification of k anonymity techniques 11 108

5 4 Generalization hierarchy for QI Marital status Sex 110

5 5 Index assignment to attributes Marital status and Sex 110

5 6 An example of set enumeration tree over set I 1 2 3

of indexes 111

5 7 Sub hierarchies computed by Incognito for the table in

Figure 5 1 112

5 8 Spatial representation a and possible partitioning b

d of the table in Figure 5 1 113

5 9 An example of decision tree 117

5 10 Different approaches for combining k anonymity and

data mining 118

5 11 An example of top down anonymization for the private

table in Figure 5 1 123

5 12 Frequent itemsets extracted from the table in Figure 5 1 124

5 13 An example of binary table 125

5 14 Itemsets extracted from the table in Figure 5 13 b 126

5 15 Itemsets with support at least equal to 40 a and corre

sponding anonymized itemsets b 126

5 16 3 anonymous version of the tree of Figure 5 9 128

5 17 Suppression of occurrences in non leaf nodes in the tree

in Figure 5 9 130

5 18 Table inferred from the decision tree in Figure 5 17 130

5 19 11 anonymous version of the tree in Figure 5 17 130

5 20 Table inferred from the decision tree in Figure 5 19 131

6 1 Illustration of the Information Loss Metric 147

7 1 Using known points and distance relationship to infer the

rotation matrix 173

xvi PRIVACY PRESERVING DATA MINING MODELS AND ALGORITHMS

9 1 A taxonomy tree on categorical attribute Education 219

9 2 A taxonomy tree on continuous attribute Age 219

9 3 Interactive graph 230

9 4 A decomposition 230

10 1 CENSUS 19 259

10 2 Perturbation Matrix Condition Numbers 19 260

13 1 Relationship between Secure Sub protocols and Privacy

Preserving Distributed Data Mining on Horizontally Par

titioned Data 321

14 1 Two dimensional problem that cannot be decomposed

into two one dimensional problems 339

15 1 Wigner s semi circle law a histogram of the eigenvalues

for a large randomly generated A 361

17 1 Skeleton of a simulatable private randomized auditor 421

18 1 Some Examples of Generalization for 2 Anonymity 433

18 2 Upper Bound of 2 anonymity Probability in an Non

Empty Grid Cell 433

18 3 Fraction of Data Points Preserving 2 Anonymity with

Data Dimensionality Gaussian Clusters 438

18 4 Minimum Information Loss for 2 Anonymity Gaussian

Clusters 444

18 5 Randomization Level with Increasing Dimensionality

Perturbation level 8 o U niDis 456

19 1 Microdata and generalization 463

19 2 The taxonomy of attribute Disease 463

19 3 A possible result of our generalization scheme 467

19 4 The voter registration list 468

19 5 Algorithm for computing personalized generalization 474

19 6 Algorithm for finding the optimal SA generalization 478

19 7 Personalized k anonymous generalization 480

20 1 Related streams tables 489

20 2 The join stream 489

20 3 Example with 3 streams at initialization 496

20 4 After bottom up propagations 498

20 5 After top down propagations 499

20 6 UK road accident data 2001 502

20 7 Classifier accuracy 502