Applied Data Science PDF

15d ago
0 Views
0 Downloads
3.45 MB
141 Pages
Transcription

Applied Data ScienceIan LangmoreDaniel Krasner

2

ContentsIProgramming Prerequisites1 Unix1.1 History and Culture . . . .1.2 The Shell . . . . . . . . . .1.3 Streams . . . . . . . . . . .1.3.1 Standard streams . .1.3.2 Pipes . . . . . . . .1.4 Text . . . . . . . . . . . . .1.5 Philosophy . . . . . . . . .1.5.1 In a nutshell . . . .1.5.2 More nuts and bolts1.6 End Notes . . . . . . . . . .1.2235679101010112 Version Control with Git2.1 Background . . . . . . . . . . . . . . . . . . . . .2.2 What is Git . . . . . . . . . . . . . . . . . . . . .2.3 Setting Up . . . . . . . . . . . . . . . . . . . . . .2.4 Online Materials . . . . . . . . . . . . . . . . . .2.5 Basic Git Concepts . . . . . . . . . . . . . . . . .2.6 Common Git Workflows . . . . . . . . . . . . . .2.6.1 Linear Move from Working to Remote . .2.6.2 Discarding changes in your working copy2.6.3 Erasing changes . . . . . . . . . . . . . .2.6.4 Remotes . . . . . . . . . . . . . . . . . . .2.6.5 Merge conflicts . . . . . . . . . . . . . . .131313141415151617171718.3 Building a Data Cleaning Pipeline with Python193.1 Simple Shell Scripts . . . . . . . . . . . . . . . . . . . . . . . 193.2 Template for a Python CLI Utility . . . . . . . . . . . . . . . 21i

iiIICONTENTSThe Classic Regression Models234 Notation244.1 Notation for Structured Data . . . . . . . . . . . . . . . . . . 245 Linear Regression5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .5.2 Coefficient Estimation: Bayesian Formulation . . . . . .5.2.1 Generic setup . . . . . . . . . . . . . . . . . . . .5.2.2 Ideal Gaussian World . . . . . . . . . . . . . . .5.3 Coefficient Estimation: Optimization Formulation . . .5.3.1 The least squares problem and the singular valuecomposition . . . . . . . . . . . . . . . . . . . . .5.3.2 Overfitting examples . . . . . . . . . . . . . . . .5.3.3 L2 regularization . . . . . . . . . . . . . . . . . .5.3.4 Choosing the regularization parameter . . . . . .5.3.5 Numerical techniques . . . . . . . . . . . . . . .5.4 Variable Scaling and Transformations . . . . . . . . . .5.4.1 Simple variable scaling . . . . . . . . . . . . . . .5.4.2 Linear transformations of variables . . . . . . . .5.4.3 Nonlinear transformations and segmentation . .5.5 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . .5.6 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .de. . . . . . . . . . . . . . . . . . . . . . .6 Logistic Regression6.1 Formulation . . . . . . . . . . . . . . . . .6.1.1 Presenter’s viewpoint . . . . . . .6.1.2 Classical viewpoint . . . . . . . . .6.1.3 Data generating viewpoint . . . . .6.2 Determining the regression coefficient w .6.3 Multinomial logistic regression . . . . . .6.4 Logistic regression for classification . . . .6.5 L1 regularization . . . . . . . . . . . . . .6.6 Numerical solution . . . . . . . . . . . . .6.6.1 Gradient descent . . . . . . . . . .6.6.2 Newton’s method . . . . . . . . . .6.6.3 Solving the L1 regularized problem6.6.4 Common numerical issues . . . . .6.7 Model evaluation . . . . . . . . . . . . . .6.8 End Notes . . . . . . . . . . . . . . . . . 26466676870707273

CONTENTSiii7 Models Behaving Well747.1 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75IIIText Data768 Processing Text8.1 A Quick Introduction . . . . . . . . . . . . . . . . . .8.2 Regular Expressions . . . . . . . . . . . . . . . . . . .8.2.1 Basic Concepts . . . . . . . . . . . . . . . . . .8.2.2 Unix Command line and regular expressions . .8.2.3 Finite State Automata and PCRE . . . . . . .8.2.4 Backreference . . . . . . . . . . . . . . . . . . .8.3 Python RE Module . . . . . . . . . . . . . . . . . . . .8.4 The Python NLTK Library . . . . . . . . . . . . . . .8.4.1 The NLTK Corpus and Some Fun things to doIV.Classification9 Classification9.1 A Quick Introduction . . . . . . . . .9.2 Naive Bayes . . . . . . . . . . . . . . .9.2.1 Smoothing . . . . . . . . . . .9.3 Measuring Accuracy . . . . . . . . . .9.3.1 Error metrics and ROC Curves9.4 Other classifiers . . . . . . . . . . . . .9.4.1 Decision Trees . . . . . . . . .9.4.2 Random Forest . . . . . . . . .9.4.3 Out-of-bag classification . . . .9.4.4 Maximum Entropy . . . . . . .V.7777787879828384878789.Extras10 High(er) performance Python10.1 Memory hierarchy . . . . . . . . . . . .10.2 Parallelism . . . . . . . . . . . . . . . .10.3 Practical performance in Python . . . .10.3.1 Profiling . . . . . . . . . . . . . .10.3.2 Standard Python rules of thumb90. 90. 90. 93. 94. 94. 99. 99. 101. 102. 103105.106. 107. 110. 114. 114. 117

ivCONTENTS10.3.310.3.410.3.510.3.610.3.7For loops versus BLAS . . . . . .Multiprocessing Pools . . . . . .Multiprocessing example: StreamNumba . . . . . . . . . . . . . .Cython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .processing text. . . . . . . . . . . . . . . . . . . . .files. . . . .122123124129129

CONTENTSvWhat is data science? With the major technological advances of the lasttwo decades, coupled in part with the internet explosion, a new breed ofanalysist has emerged. The exact role, background, and skill-set, of a datascientist are still in the process of being defined and it is likely that by thetime you read this some of what we say will seem archaic.In very general terms, we view a data scientist as an individual who usescurrent computational techniques to analyze data. Now you might makethe observation that there is nothing particularly novel in this, and subsequenty ask what has forced the definition.1 After all statisticians, physicists,biologisitcs, finance quants, etc have been looking at data since their respective fields emerged. One short answer comes from the fact that the datasphere has changed and, hence, a new set of skills is required to navigate iteffectively. The exponential increase in computational power has providednew means to investigate the ever growing amount of data being collectedevery second of the day. What this implies is the fact that any moderndata analyst will have to make the time investment to learn computationaltechniques necessary to deal with the volumes and complexity of the dataof today. In addition to those of mathemics and statistics, these softwareskills are domain transfereable and so it makes sense to create a job titlethat is also transferable. We could also point to the “data hype” created inindustry as a culprit for the term data science with the science creating anaura of validity and facilitating LinkedIn headhunting.What skills are needed? One neat way we like to visualize the datascience skill set is with Drew Conway’s Venn Diagram[Con], see figure 1.Math and statistics is what allows us to properly quantify a phenomenonobserved in data. For the sake of narrative lets take a complex deterministicsituation, such as whether or not someone will make a loan payment, andattempt to answer this question with a limited number of variables and animperfect understanding of those variables influence on the event we wish topredict. With the exception of your friendly real estate agent we generallyacknowldege our lack of soothseer ability and make statements about theprobability of this event. These statements take a mathematical form, forexampleP[makes-loan-payment] eα β·creditscore .1William S. Cleveland decide to coin the term data science and write Data Science:An action plan for expanding the technical areas of the field of statistics [Cle]. His reportoutlined six points for a university to follow in developing a data analyst curriculum.

viCONTENTSFigure 1: Drew Conway’s Venn Diagramwhere the above quantifies the risk associated with this event. Deciding onthe best coefficients α and β can be done quite easily by a host of softwarepackages. In fact anyone with decent hacking skills can do achieve the goal.Of course, a simple model such as this would convince no one and wouldcall for substantive expertise (more commonly called domain knowledge) tomake real progress. In this case, a domain expert would note that additionalvariables such as the loan to value ratio and housing price index are neededas they have a huge effect on payment activity. These variables and manyothers would allow us to arrive at a “better” modelP[makes-loan-payment] eα β·X .(1)Finally we have arrived at a model capable of fooling someone! We couldkeep adding variables until the model will almost certainly fit the historicrisk quite well. BUT, how do we know that this will allow us to quantifyrisk in the future? To make some sense of our uncertainty 2 about our modelwe need to know eactly what (1) means. In particular, did we include toomany variables and overfit? Did our method of solving (1) arrive at a goodsolution or just numerical noise? Most importantly, how appropriate is thelogistic regression model to begin with? Answering these questions is oftenas much an art as a science, but in our experience, sufficient mathematicalunderstanding is necessary to avoid getting lost.2The distrinction between uncertainty and risk has been talked about quite extensivelyby Nassim Taleb[Tal05, Tal10]

CONTENTSviiWhat is the motivation for, and focus of, this course? Just as common as the hacker with no domain knowledge, or the domain expert withno statistical no-how is the traditional academic with meager computingskills. Academia rewards papers containing original theory. For the mostpart it does not reward the considerable effort needed to produce high quality, maintainable code that can be used by others and integrated into largerframeworks. As a result, the type of code typically put forward by academicsis completely unuseable in industry or by anyone else for that matter. Itis often not the purpose or worth the effort to write production level codein an academic environment. The importance of this cannot be overstated.Consider a 20 person start-up that wishes to build a smart-phone app thatrecommends restaurants to users. The data scientist hired for this job willneed to interact with the company database (they will likely not be handeda neat csv file), deal with falsely entered or inconveniently formatted data,and produce legible reports, as well as a working model for the rest of thecompany to integrate into its production framework. The scientist may beexpected to do this work without much in the way of software support. Now,considering how easy it is to blindly run most predictive software, our hypothetical company will be tempted to use a programmer with no statisticalknowledge to do this task. Of course, the programmer will fall into analytictraps such as the ones mentioned above but that might not deter anyonefrom being content with output. This anecdote seems construed, but in reality it is something we have seen time and time again. The current world ofdata analysis calls for a myriad of skills, and clean programming, databaseinteraction and understand of architecture have all become the minimum tosucceed.The purpose of this course is to take people with strong mathematical/statistical knowledge and teach them software development fundamentals3 .This course will cover Design of small software packages Working in a Unix environment Designing software in teams Fundamental statistical algorithms such as linear and logistic regression3Our view of what constitutes the necessary fundamentals is strongly influenced by theteam at software carpentry[Wila]

viiiCONTENTS Overfitting and how to avoid it Working with text data (e.g. regular expressions) Time series And more. . .

Part IProgramming Prerequisites1

Chapter 1UnixSimplicity is the key to brilliance-Bruce Lee1.1History and CultureThe Unix operating system was developed in 1969 at AT&T’s Bell Labs.Today Unix lives on through its open source offspring, Linux. This Operating system the dominant force in scientific computing, super computing,and web servers. In addition, mac OSX (which is unix based) and a varietyof user friendly Linux operating systems represent a significant portion ofthe personal computer market. To understand the reasons for this success,some history is needed.In the 1960s, MIT, AT&T Bell Labs, and General Electric developed atime-sharing (meaning different users could share one system) operatingsystem called Multics. Multics was found to be too complicated. This“failure” led researchers to develop a new operating system that focusedon simplicity. This operating system emphasized ease of communicationamong many simple programs. Kernighan and Pike summarized this as“the idea that the power of a system comes more from the relationshipsamong programs than from the programs themselves.”The Unix community was integrated with the Internet and networked com2

1.2. THE SHELL3Figure 1.1: Ubuntu’s GUI and CLIputing from the beginning. This, along with the solid fundamental design,could have led to Unix becoming the dominant computing paradigm duringthe 1980’s personal computer revolution. Unfortunately, infighting and poorbusiness decisions kept Unix out of the mainstream.Unix found a second life, not so much through better business decisions, butthrough the efforts of Richard Stallman and GNU Project. The goal was toproduce a Unix-like operating system that depended only on free software.Free in this case meant, “users are free to run the software, share it, studyit, and modify it.” The GNU Project succeeded in creating a huge suiteof utilities for use with an operating system (e.g. a C compiler) but werelacking the kernel (which handles communication between e.g. hardware andsoftware, or among processes). It just so happened that Linux Torvalds haddeveloped a kernel (the “Linux” kernel) in need of good utilities. Togetherthe Linux operating system was born.1.2The ShellModern Linux distributions, such as Ubuntu, come with a graphical userinterface (GUI) every bit as slick as Windows or Mac OSX. Software is easyto install and with at most a tiny bit of work all non-proprietary applicationswork fine. The real power of Unix is realized when you start using the shell.Digression 1: Linux without tears

4CHAPTER 1. UNIXThe easiest way to have access to the bash shell and a modern scientific computing environment is to buy hardware that is pre-loadedwith Linux. This way, the har

Chapter 1 Unix Simplicity is the key to brilliance-Bruce Lee 1.1 History and Culture The Unix operating system was developed in 1969 at AT&T’s Bell Labs.