当前位置:首页 >> >> An Introduction to S and The Hmisc and Design Libraries

An Introduction to S and The Hmisc and Design Libraries


An Introduction to S and The Hmisc and Design Libraries
Carlos Alzola, MS Statistical Consultant 401 Glyndon Street SE Vienna, Va 22180 calzola@cox.net Frank Harrell, PhD Professor of Biostatistics Department of Biostatistics Vanderbilt University School of Medicine S-2323 Medical Center North Nashville, Tn 37232 f.harrell@vanderbilt.edu http://biostat.mc.vanderbilt.edu/RS September 24, 2006

ii

Updates to this document may be obtained from biostat.mc.vanderbilt.edu/RS/sintro.pdf.

Contents
1 Introduction 1.1 S, S-Plus, R, and Source References . . . . . . . . . . . . . 1.1.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Starting S . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 UNIX/Linux . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Commands vs. GUIs . . . . . . . . . . . . . . . . . . . . . . 1.4 Basic S Commands . . . . . . . . . . . . . . . . . . . . . . . 1.5 Methods for Entering and Saving S Commands . . . . . . . 1.5.1 Specifying System File Names in S . . . . . . . . . . 1.6 Di?erences Between S and SAS . . . . . . . . . . . . . . . . 1.7 A Comparison of UNIX/Linux and Windows for Running S 1.8 System Requirements . . . . . . . . . . . . . . . . . . . . . 1.9 Some Useful System Tools . . . . . . . . . . . . . . . . . . . 1 1 4 4 4 5 7 7 9 11 11 18 19 19 25 25 25 29 30 31 32 33 34 34 36 38 39 40 42 43 44 45 51

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

2 Objects, Getting Help, Functions, Attributes, and Libraries 2.1 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Numeric, Character and Logical Vectors . . . . . . . . . 2.4.2 Missing Values and Logical Comparisons . . . . . . . . . 2.4.3 Subscripts and Index Vectors . . . . . . . . . . . . . . . 2.5 Matrices, Lists and Data Frames . . . . . . . . . . . . . . . . . 2.5.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 The Class Attribute and Factor Objects . . . . . . . . . 2.6.2 Summary of Basic Object Types . . . . . . . . . . . . . 2.7 When to Quote Constants and Object Names . . . . . . . . . . 2.8 Function Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 The Hmisc Library . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Installing Add–on Libraries . . . . . . . . . . . . . . . . . . . . iii

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

iv

CONTENTS 2.11 Accessing Add–On Libraries Automatically . . . . . . . . . . . . . . . . . . . . . . . 52 53 53 53 53 54 55 62 63 64 65 65 66 66 67 67 71 71 72 76 78 78 78 80 81 82 83 84 85 85 85 88 89 93 94 94 96 97 101 103 106 106 107 108

3 Data in S 3.1 Importing Data . . . . . . . . . . . . . . . 3.2 Reading Data into S . . . . . . . . . . . . 3.2.1 Reading Raw Data . . . . . . . . . 3.2.2 Reading S-Plus Data into R . . . 3.2.3 Reading SAS Datasets . . . . . . . 3.2.4 Handling Date Variables in R . . . 3.3 Displaying Metadata . . . . . . . . . . . . 3.4 Adjustments to Variables after Input . . . 3.5 Writing Out Data . . . . . . . . . . . . . 3.5.1 Writing ASCII ?les . . . . . . . . . 3.5.2 Transporting S Data . . . . . . . . 3.5.3 Customized Printing . . . . . . . . 3.5.4 Sending Output to a File . . . . . 3.6 Using the Hmisc Library to Inspect Data

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

4 Operating in S 4.1 Reading and Writing Data Frames and Variables . . . . . . . . . . . . . . . . . . 4.1.1 The attach and detach Functions . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Subsetting Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Adding Variables to a Data Frame without Attaching . . . . . . . . . . . 4.1.4 Deleting Variables from a Data Frame . . . . . . . . . . . . . . . . . . . . 4.1.5 A Better Approach to Changing Data Frames: upData . . . . . . . . . . . 4.1.6 assign and store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Managing Project Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Accessing Remote Objects and Di?erent Objects with the Same Names . 4.2.2 Documenting Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Accessing Data in Windows S-Plus . . . . . . . . . . . . . . . . . . . . . 4.3 Miscellaneous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Functions for Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 By Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Sending Multiple Variables to Functions Expecting only One . . . . . . . 4.3.4 Functions for Data Manipulation and Management . . . . . . . . . . . . . 4.3.5 Merging Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Merging Baseline Data with One–Number Summaries of Follow–up Data 4.3.7 Constructing More Complex Summaries of Follow-up Data . . . . . . . . 4.3.8 Subsetting a Data Frame by Examining Repeated Measurements . . . . . 4.3.9 Converting Between Matrices and Vectors: Re–shaping Serial Data . . . . 4.3.10 Computing Changes in Serial Observations . . . . . . . . . . . . . . . . . 4.4 Recoding Variables and Creating Derived Variables . . . . . . . . . . . . . . . . . 4.4.1 The score.binary Function . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 The recode Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Should Derived Variables be Stored Permanently? . . . . . . . . . . . . . 4.5 Review of Data Frame Creation, Annotation, and Analysis . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS 4.6 4.7 4.8

v

Dealing with Many Data Frames Simultaneously . . . . . . . . . . . . . . . . . . . . 110 Missing Value Imputation using Hmisc . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Using S for Simulations and Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . 115 123 123 126 129 135 136 138 141 141 144 150 151 153 153 163 165

5 Probability and Statistical Functions 5.1 Basic Functions for Statistical Summaries . 5.2 Functions for Probability Distributions . . . 5.3 Hmisc Functions for Power and Sample Size 5.4 Statistical Tests . . . . . . . . . . . . . . . . 5.4.1 Nonparametric Tests . . . . . . . . . 5.4.2 Parametric Tests . . . . . . . . . . .

. . . . . . . . . . . . . . . . Calculations . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6 Making Tables 6.1 S-Plus–supplied Functions . . . . . . . . . . . . . . . 6.2 The Hmisc summary.formula Function . . . . . . . . . 6.2.1 Implementing Other Interfaces . . . . . . . . . 6.3 Graphical Depiction of Two–Way Contingency Tables

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 Hmisc Generalized Least Squares Modeling Functions 7.1 Automatically Transforming Predictor and Response Variables . . . . . . . . . . . . 7.2 Robust Serial Data Models: Time– and Dose–Response Pro?les . . . . . . . . . . . . 7.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 Builtin S Functions for Multiple Linear Regression 169 8.1 Sequential and Partial Sums of Squares and F –tests . . . . . . . . . . . . . . . . . . 172 9 The Design Library of Modeling Functions 9.1 Statistical Formulas in S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Purposes and Capabilities of Design . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Di?erences Between lm (Builtin) and Design’s ols Function . . . . . 9.3 Examples of the Use of Design . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Examples with Graphical Output . . . . . . . . . . . . . . . . . . . . . 9.3.2 Binary Logistic Modeling with the Prostate Data Frame . . . . . . . . 9.3.3 Troubleshooting Problems with factor Predictors . . . . . . . . . . . 9.3.4 A Comprehensive Hypothetical Example . . . . . . . . . . . . . . . . . 9.3.5 Using Design and Interactive Graphics to Generate Flexible Functions 9.4 Checklist of Problems to Avoid When Using Design . . . . . . . . . . . . . . 9.5 Describing Representation of Subjects . . . . . . . . . . . . . . . . . . . . . . 10 Principles of Graph Construction 10.1 Graphical Perception . . . . . . . . . . 10.2 General Suggestions . . . . . . . . . . 10.3 Tufte on “Chartjunk” . . . . . . . . . 10.4 Tufte’s Views on Graphical Excellence 10.5 Formatting . . . . . . . . . . . . . . . 10.6 Color, Symbols, and Line Styles . . . . 175 175 176 181 181 181 194 197 198 200 201 202 203 203 204 205 205 205 206

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

vi 10.7 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 Displaying Estimates Strati?ed by Categories . . . . . . . . . . 10.9 Displaying Distribution Characteristics . . . . . . . . . . . . . . 10.10Showing Di?erences . . . . . . . . . . . . . . . . . . . . . . . . 10.11Choosing the Best Graph Type . . . . . . . . . . . . . . . . . . 10.11.1 Single Categorical Variable . . . . . . . . . . . . . . . . 10.11.2 Single Continuous Numeric Variable . . . . . . . . . . . 10.11.3 Categorical Response Variable vs. Categorical Ind. Var. 10.11.4 Categorical Response vs. a Continuous Ind. Var. . . . . 10.11.5 Continuous Response Variable vs. Categorical Ind. Var. 10.11.6 Continuous Response vs. Continuous Ind. Var. . . . . . 10.12Conditioning Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 206 207 207 208 208 209 209 209 209 209 209 213 213 220 223 227 234 236 236 238 241 241 243 244 247 251 251 251 253 256 260 261 261 263 263 265 265 265 267 267 279 282

11 Graphics in S 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Adding Text or Legends and Identifying Observations . . . . . . . 11.3 Hmisc and Design High–Level Plotting Functions . . . . . . . . . . 11.4 trellis Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Multiple Response Variables and Error Bars . . . . . . . . . 11.4.2 Multiple x–axis Variables and Error Bars in Dot Plots . . . 11.4.3 Using summarize with trellis . . . . . . . . . . . . . . . . 11.4.4 A Summary of Functions for Aggregating Data for Plotting 12 Controlling Graphics Details 12.1 Graphics Parameters . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 The Graphics Region . . . . . . . . . . . . . . . . . . . 12.1.2 Controlling Text and Margins . . . . . . . . . . . . . . 12.1.3 Controlling Plotting Symbols . . . . . . . . . . . . . . 12.1.4 Multiple Plots . . . . . . . . . . . . . . . . . . . . . . 12.1.5 Skipping Over Plots . . . . . . . . . . . . . . . . . . . 12.1.6 A More Flexible Layout . . . . . . . . . . . . . . . . . 12.1.7 Controlling Axes . . . . . . . . . . . . . . . . . . . . . 12.1.8 Overlaying Figures . . . . . . . . . . . . . . . . . . . . 12.2 Specifying a Graphical Output Device . . . . . . . . . . . . . 12.2.1 Opening Graphics Windows . . . . . . . . . . . . . . . 12.2.2 The postscript, ps.slide, setps, setpdf Functions 12.2.3 The win.slide and gs.slide Functions . . . . . . . . 12.2.4 Inserting S Graphics into Microsoft O?ce Documents 13 Managing Batch Analyses, and Writing Your Own 13.1 Using S in Batch Mode . . . . . . . . . . . . . . . . 13.1.1 Batch Jobs in UNIX . . . . . . . . . . . . . . 13.1.2 Batch Jobs in Windows . . . . . . . . . . . . 13.2 Managing S Non-Interactive Programs . . . . . . . . 13.3 Reproducible Analysis . . . . . . . . . . . . . . . . . 13.4 Reproducible Reports . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

CONTENTS 13.5 Writing Your Own Functions . . . . . 13.5.1 Some Programming Commands 13.5.2 Creating a New Function . . . 13.6 Customizing Your Environment . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii 282 282 283 284 287

viii

CONTENTS

List of Tables
1.1 1.2 2.1 4.1 4.2 5.1 5.2 5.3 5.4 6.1 9.1 9.2 9.3 9.4 9.5 Comparisons of SAS and S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SAS Procedures and Corresponding S Functions . . . . . . . . . . . . . . . . . . . . Comparison of Some S Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functions for Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functions for Data Manipulation and Management . . . . . . . . . . . . . . . . . . . Functions for Statistical Summaries . . . Probability Distribution Functions . . . Hmisc Functions for Power/Sample Size S Functions for Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 18 42 85 90 124 127 129 135

Descriptive Statistics by Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Operators in Formulae . . . . . . . . . . . . . Special ?tting functions . . . . . . . . . . . . Functions for transforming predictor variables Generic Functions and Methods . . . . . . . . Generic Functions and Methods . . . . . . . . . . . . . . . . . . . . in models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 178 178 179 180

11.1 Non–trellis High Level Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . 224 12.1 Low Level Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

ix

x

LIST OF TABLES

List of Figures
5.1 6.1 7.1 7.2 7.3 7.4 7.5 7.6 7.7 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 Characteristics of control and intervention groups . . . . . . . . . . . . . . . . . . . . 134 A two–way contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Transformations estimated by avas . . . . . . . . . . . . . . . . Distribution of residuals from avas ?t . . . . . . . . . . . . . . avas transformation vs. reciprocal . . . . . . . . . . . . . . . . Predicted median glyhb as a function of age and chol. . . . . Nonparametric estimates of time trends for individual subjects Bootstrap estimates of time trends . . . . . . . . . . . . . . . . Simultaneous and pointwise bootstrap con?dence regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 159 159 162 166 167 168 182 184 185 186 187 188 191 192 193

Cholesterol interacting with categorized age . . . . . . . . . . . . . . . . . . . Restricted cubic spline surface in two variables, each with k = 4 knots . . . . Fit with age × spline(cholesterol) and cholesterol × spline(age) . . . . . . . . Spline ?t with simple product interaction . . . . . . . . . . . . . . . . . . . . Predictions from linear interaction model with mean age in tertiles indicated Summary of model using odds ratios and inter–quartile–range odds ratios . . Cox PH model strati?ed on sex, with interaction between age spline and sex . Nomogram from ?tted Cox model . . . . . . . . . . . . . . . . . . . . . . . . Nomogram from ?tted Cox model . . . . . . . . . . . . . . . . . . . . . . . .

10.1 Error bars for individual means and di?erences . . . . . . . . . . . . . . . . . . . . . 208 11.1 Basic Plot . . . . . . . . . . . . . . . . . 11.2 Basic Plot with Labels and Title . . . . 11.3 Plotting a Factor . . . . . . . . . . . . . 11.4 Example of Boxplot . . . . . . . . . . . 11.5 Example of Plot on a Fitted Model . . . 11.6 Overriding datadist Values . . . . . . . . 11.7 Example of Co-Plot . . . . . . . . . . . 11.8 Identifying Observations . . . . . . . . . 11.9 datadensity plot for the prostate data 11.10Box–percentile plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 215 216 216 218 219 220 222 225 227

xi

xii

LIST OF FIGURES 11.11Extended box plot for titanic data. Shown are the median, mean (solid dot), and quantile intervals containing 0.25, 0.5, 0.75, and 0.9 of the age distribution. . . . . . 231 11.12Multi–panel trellis graph produced by the Hmisc ecdf function. . . . . . . . . . . 232 12.1 Plot Region . . . . . . . . . . 12.2 Text in margins . . . . . . . . 12.3 Plotting Symbols . . . . . . . 12.4 Di?erent Types of Lines . . . 12.5 Flexible layout using mfg . . 12.6 Controlling Axis Labels Style 12.7 Examples of tick marks . . . 12.8 Use of axis . . . . . . . . . . 12.9 Overlaying high-level plots . . 12.10Example of subplot . . . . . . 12.11Another subplot example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 248 250 250 252 254 256 258 259 260 261

Chapter 1

Introduction

1.1

S, S-Plus, R, and Source References

S-Plus and R are supersets of the S language1 , an interactive programming environment for data analysis and graphics. Insightful Corporation in Seattle took the AT&T Bell Labs S code and enhanced it producing many new statistical functions and graphical interfaces. In this text we use S to refer to both S-Plus and R languages. S is a unique combination of a powerful language and ?exible, high-quality graphics functions. What is most important about S is that it was designed to be extendable. Insightful, AT&T (now Lucent Technologies), and a large community of S-Plus users and R developers and users are constantly adding new capabilities to the system, all using the same high-level language. S allows users to take advantage of an explosion of powerful new data analysis and statistical modeling techniques. The richness of the S language and its planned extendability allow users to perform comprehensive analyses and data explorations with a minimum of programming. As an example, S functions in the Design library (see Chapter 9) can perform analyses and make graphical representations that would take pages of programming in other systems if they could be done at all:

1 S,

which may stand for statistics, was developed by the same lab that developed the C language.

1

2
# # # # # # # #

CHAPTER 1. INTRODUCTION
Fit binary logistic model without assuming linearity for age or equal shapes of the age relationship for the two sexes Represent age using a restricted cubic spline function with 4 knots This requires 3 age parameters per sex. Model has intercept + 6 coefficients. x=T, y=T causes design matrix and response vector to be stored in the fit object f. This allows certain residuals to be computed later, and it allows the original data to be re-analyzed later (e.g., bootstrapping and cross-validation)

f ← lrm(death ? rcs(age,4)*sex, x=T, y=T) # Test for age*sex interaction (3 d.f.), linearity in age (4 d.f.), # overall age effect (6 d.f.), overall sex effect (4 d.f.), # linearity of age interaction with sex (2 d.f.) anova(f) # Compute the 60:40 year odds ratio for females summary(f, age=c(40,60), sex=’female’) # Plot the age effects separately by sex, with confidence bands plot(f, age=NA, sex=NA) # Validate the model using the bootstrap - check for overfitting validate(f) # Draw a nomogram depicting the model, adding an axis for the # predicted probability of death nomogram(f, fun=plogis, funlabel=’Prob(death)’) # Get predicted log odds of death for 40 year old male predict(f, data.frame(age=40,sex=’male’)) # Make a new S-Plus function which analytically computes predicted # values from the fitted model g ← Function(f) # Use this function to duplicate the above prediction for 40 year old male g(age=40, sex=’male’)

By making a high-level language the cornerstone of S, you could say that S is designed to be ine?cient for some applications from a pure CPU time point of view. However, computer time is inexpensive in comparison with personnel time, and analysts who have learned S can be very much more productive in doing data analyses. They can usually do more complex and exploratory analyses in the same time that standard analyses take using other systems. In its most simple use, S is an interactive calculator. Commands are executed (or debugged) as they are entered. The S language is based on the use of functions to perform calculations, open graphics windows, set system options, and even for exiting the system. Variables can refer to singlevalued scalars, vectors, matrices, or other forms. Ordinarily a variable is stored as a vector, e.g., age will refer to all the ages of subjects in a dataset. Perhaps the biggest challenge to learning S for

1.1. S, S-PLUS, R, AND SOURCE REFERENCES

3

long-time users of single observation oriented packages such as SAS is to think in terms of vectors instead of a variable value for a single subject. In SAS you might say
PROC MEANS; VAR age; *Get mean and other statistics for age; DATA new; SET old; IF age < 16 THEN ageg=’Young’; ELSE ageg=’Old’;

The IF statement would be executed separately for each input observation. In contrast, to reference a single value of age in S, say for the 13th subject, you would type age[13]. To create the ageg variable for all subjects you would use the S ifelse function, which operates on vectors2 :
mean(age) # Computed immediately, not in a separate step ageg ← ifelse(age < 16, ’Young’, ’Old’)

The assignment operator ← is typed as <-. To show how function calls can be intermixed with other operations, look how easy it is to compute the number of subjects having age < the mean age:
sum(age < mean(age)) # could have used table(age < mean(age)) # or to get the proportion, use mean(age < mean(age))

In S you can create and operate on very complex objects. For example, a ?exible type of object called a list can contain any arbitrary collection of other objects. This makes examination of regression model ?ts quite easy, as a “?t object” can contain a variety of objects of di?ering shapes, such as the vector of regression coe?cients, covariance matrix, scalar R2 value, number of observations, functions specifying how the predictors were transformed, etc. S is object oriented. Many of its objects have one or more classes, and there are generic functions that know what to do with objects of certain classes. For example, if you use S’s linear model function lm to create a “?t object” called f, this object will have class ’lm’. Typing the commands print(f), summary(f), or plot(f) will cause the print.lm, summary.lm, or plot.lm functions to be executed. Typing methods(lm) or methods(class=’lm’) will give useful information about methods for creating or operating on lm objects. Basic sources for learning S are the manuals that come with the software. Another basic source for learning S (and hence, S-Plus) is a book called the New S language (a.k.a. “the blue book”), by Becker, Chambers and Wilks (1988). One step above the previous one is Chambers and Hastie (1992). Good introductions are Spector (1994) and Krause and Olson (2000). Other excellent books are Venables and Ripley (1999, 2000). Ripley has many useful S functions and other valuable material available from his Web page (http://www.stats.ox.ac.uk/?ripley/)3 . A variety of manuals come with S-Plusand R, from beginner’s guides to more advanced programmer’s manuals. Also see F.E. Harrell’s book Regression Modeling Strategies (which has long case studies using S with commands and printed and graphical output), and other references listed in the bibliography. Another source of help are the S-news and R-help mailing lists (see biostat.mc.vanderbilt.edu/ rms). Although not exclusively related to S and much of the material related to S packages is out of date, the statlib Web server lib.stat.cmu.edu can provide speci?c software for some problems.
2 Note that a missing value for age in SAS would result in the person being categorized as ’Young’. In S the result would be a missing value (NA) for such subjects. 3 Venables and Ripley’s MASS S library has a wide variety of useful functions as well as many datasets useful for learning both biostatistical methods and S.

4

CHAPTER 1. INTRODUCTION

Also consult Insightful’s Web page http://www.insightful.com. The AT&T / Lucent Technologies Web page (http://www.research.att.com/areas/stat/) points to many valuable technical reports related to the S language. The “Visual Demo”, available from the Help button in Windows S-Plus is a helpful introduction to the system. We will concentrate on using S from a Linux or UNIX workstation or Windows S for Microsoft Windows or NT. When we do not distinguish between the two platforms, most of the commands described will work exactly the same in both contexts.

1.1.1

R

R is an open-source version of the S language (strictly speaking R uses a language that is very compatible with but not identical to S). R runs on all major platforms including running in native mode on some Macintosh operating systems. All of R’s source code is available for inspection and modi?cation. The system and its documentation is available from http://www.r-project.org. The Hmisc and Design libraries are fully available for R. Almost all of the command syntax used in this book can be used identically for R. There are many subtle di?erences in how graphical parameters are handled. R uses essentially Version 3 of the S language but with di?erent rules for how objects are found from inside functions when they are not passed as arguments. R has no graphical user interface on Linux and UNIX and only a rudimentary one on Windows. It lacks many of the Microsoft O?ce linkages and data import/export capabilities that S-Plus has. It has most of the functions S-Plus has, however. R runs slightly faster than S-Plus for certain applications (especially highly iterative ones) and provides easy-to-use functions for downloading and updating add-on libraries (which R calls “packages”). As R is free, it can readily be used in conjunction with web servers. For a software developer, R’s online help ?les are somewhat better organized than those in S-Plus.

1.2
1.2.1

Starting S
UNIX/Linux

For now, we will discuss the use of S interactively. Before you start S, you should have created a directory where you will keep the data and code related to the particular project. For instance, in UNIX from an upper level directory, type mkdir sproject cd sproject Next type mkdir .Data At this point you may want to set up so that S-Plus does not write to an ever-growing audit ?le. The .Audit ?le accumulates a ?le of all S-Plus activity across sessions. As this ?le can become quite large, you can turn it o? by forming a new empty .Audit using touch .Data/.Audit, and setting the ?le to be non-writable using chmod -w .Data/.Audit. Now you’re ready to invoke S-Plus.

1.2. STARTING S Splus S-PLUS : Copyright (c) 1988, 1995 MathSoft, Inc. S : Copyright AT&T. Version 3.3 Release 1 for Sun SPARC, SunOS 4.1.x : 1995 Working data will be in .Data >

5

If you had not created a .Data (in what follows assume the name is _Data for Windows) directory under your project area, S-Plus would have done it for you, but it would have placed it under your home directory instead of the project-speci?c directory. Creating .Data yourself results in more e?cient management of your data, since (for now) everything will be stored permanently under .Data. In Linux/UNIX, R is invoked by issuing the command R at the shell prompt. R data management is discussed in Section 4.2. While in S, you have access to all the operating system commands. The command to escape to the shell is !. So, if you want a list of the ?les in your .Data directory, including hidden ?les, creation date and group ownership, you could type: > !ls -lag .Data total 90 drwxr-xr-x 2 cfa drwxr-xr-x 7 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa -rw-r--r-- 1 cfa

staff staff staff staff staff staff staff staff staff

1024 1536 85135 132 16 64 229 24 520431

Jun Aug Jun Feb Jun May May May Nov

18 10:28 . 11 1992 .. 18 10:28 .Audit 14 1992 .First 18 10:10 .Last.value 5 1992 .Random.seed 5 1992 freqs 5 1992 i 12 1992 impute.dframe

In R, use the system function to issue operating system commands. In either R or S-Plus you can use the Hmiscsys command. Notice the ?le called .First. Its purpose is similar to that of an autoexec.sas, that is, it executes commands that you want done every time you start S. More on it later. (You could also have a .Last as well, for things you want S to do when you leave the system). Another way to execute operating system commands is to type unix("command"). The unix command is used more frequently in a programming environment.

1.2.2

Windows

Windows users ?rst need to decide whether they want to put all objects created by S-Plus in one central .Data4 directory or in a project-speci?c area. The former is OK when you are ?rst learning the system. Later it’s usually best to put the data into a project-speci?c directory. That way the directory stays relatively small, is easier to decide which objects can be deleted when you do spring cleaning, and you can more quickly back up directories for active projects. Users can manage
4 In

S-Plus 2000 or earlier on Windows, .Data is Data.

6

CHAPTER 1. INTRODUCTION

multiple project .Data directories using the Object Explorer in S-Plus(see Section 4.2.3), but this method alone does not do away with the need for the Start in or current working directory to be set so that the File menu will start searching for ?les with your project area. De?ning the Start in directory also allows S-Plus commands that read and write external ?les (e.g., source, cat, scan, read.table) to easily access your project directory and not some hidden area. Note that S-Plus 6 has a menu option for easily changing between project areas. The existence of the current working directory is what distinguishes S from Microsoft Word or Excel, applications that can be easily started from isolated ?les. These applications do not need to link binary data ?les with raw data ?les, program ?les, GUI preference ?les, graphics and other types of objects. Therefore you do not need to create customized Windows shortcuts to invoke Word, Excel, etc., although Microsoft O?ce Binder can be used to link related ?les when the need arises. The best way to set up for using Windows S-Plus is to use My Computer or Explorer to create a shortcut to S-Plus from within your project directory (if you don’t have a project directory you can create one using My Computer or Explorer). Right click and select New ... Shortcut. Then Browse to select the ?le where Splus is stored. This will be under the cmd directory (under something like splus or splusxx) and will have the regular S-Plus icon next to it. After creating the basic default short cut, right click on its icon and select Properties. In the Command line: box click to the right of Splus.exe and add something like S.DATA=.\.Data S CWD=. . In the Start in box type the full path name of your project directory, e.g. c:\projects\myproject. By specifying S CWD and S DATA, S-Plus will use a central area such as \splusxx\users\yourname for storing the Prefs directory. Prefs holds details about the graphical user interface and any changes you may have made to it such as system options. As Prefs is about 100K in size, this will save disk space when you have many project, and let settings carry over from one project to another. If you want a separate Prefs directory in each project area, substitute S PROJ=. for S CWD and S DATA in the shortcut’s command line. Creation of the S-Plus shortcut only needs to be done once per project (for S-Plus 6 you may not need to do it at all). Then to enter S-Plus with everything pointed to my project, click or double-click on the new S-Plus icon. Depending on how your default Object Explorer is set up (see Section 4.2.3), once you are inside S-Plus you will sometimes need to tell the Object Explorer where your .Data area is located so that its objects will actually appear in the explorer. Right mouse click while in the Object Explorer left pane, and select Filtering. Then click on your .Data area in the Databases window and click OK. To quit S, simply type q() from the command line (i.e., after the > prompt), or click on File → Exit under Windows. Do not exit by clicking on the X in the upper right corner as this may not close all of the ?les properly. To execute DOS commands while under S-Plus use the dos function or !. Under R use system(), and under the Hmisc library use sys() on any platform. For example, you can list the contents of the current working directory by typing !dir. To execute Windows programs, use the win3 function. The Hmisc library comes with a generic function sys that will issue the command to UNIX or DOS depending on which system is actually running at the time. See Chapter 13 for methods of running S in batch mode.

1.3. COMMANDS VS. GUIS

7

1.3

Commands vs. GUIs

Windows S-Plus is built around a menu-based point–and–click graphical user interface. This kind of interface is especially useful for analysts who use S less than once per week as there are no commands to remember. However, relying solely on the GUI has disadvantages: √ 1. You can’t do simple computations such as 5. 2. You may want to do further calculations on quantities computed by using a menu or dialog box, but the dialogs are designed to produce only a single result. If for example you want to compute 2-sided P -values for a series of z-statistics, the distributions dialog box may only provide 1-tailed probabilities. 3. There are many commands and options that have not been implemented in the GUI. 4. If you produce a complete analysis and then new data are obtained, you will have to re-select all the menu choices to re-run the analysis5 . It is di?cult to decide how to learn S-Plus because of the availability of both the graphical and the command interface. Because of the richness of the commands, the fact that GUIs are not available for add-on libraries, and the ability to save and re-use commands when data corrections or additions are made, we will emphasize the command orientation. To introduce yourself to the GUI approach, invoke the Visual Demo from the Help menu tab while S-Plus is running, or from the S-Plus program directory. At ?rst, go through the Data Import, Object Explorer, Creating a Graph, and Correlations demonstrations. Also read Chapter 2 of the online S-Plus User’s Guide and go through the Tutorial in Chapter 3. To access built-in example datasets for the tutorial, press File ... Open and look for the Examples object explorer ?le (.sbf ?le) in for example the \splusxx\cmd directory.

1.4

Basic S Commands

In its simplest form, the S command window serves as a fancy hand calculator. In the examples below S expressions are entered after the command prompt (>). For Windows S-Plus you must ?rst open the Commands window by clicking on its icon, which looks like:
> >x|

Results are displayed following the command line. Results are prefaced with a number in brackets that indicates the element number of the ?rst numeric result in each line. As the following commands produce single numbers, these element numbers are not useful. Later we will see that when a long series of results spanning several lines is produced, these counters are useful placeholders. Also note that comments (prefaced with #) appear below.
> 1+1 [1] 2
5 It is possible to save the commands produced by the dialogs and re-run these, but not all commands will run properly in non-interactive mode, and the automatically generated commands are verbose.

8

CHAPTER 1. INTRODUCTION
> 1+2*3+10 # note multiplication done before addition [1] 17 > sqrt(16) [1] 4 > 1+2^3 # note exponentiation (2 to the 3rd power) done first [1] 9 > 1+2^3*7 # exponentiation done first, addition last [1] 57 > 2*(3+4) [1] 14 > 2*(3+4)^2 [1] 98 > x ← 4 # store 4 in variable x > sqrt(x)-3/2 [1] 0.5

Even though S is useful for temporary calculations such as those above, it is more useful for operating on variables, datasets, and other objects using higher-level functions for plotting, regression analysis, etc. The following series of S commands demonstrate a complete session in which data are de?ned, a new variable is derived, two variables are displayed in a scatterplot, two variables are summarized using the three quartiles, and a correlation coe?cient is used to quantify the strength of relationship between two variables.
> # Define a small dataset using commands rather than the usual > # method of reading an external file > Age <- c( 6, 5, 4, 8, 10, 5) > height <- c(42, 39, 36, 47, 51, 37) > Height <- height*2.54 # convert from in. to cm. > options(digits=4) # don’t show so many decimal places > Height # prints Height values [1] 106.68 99.06 91.44 119.38 129.54 93.98 > plot(Age, Height) > quantile(Age, c(.25,.5,.75)) 25% 50% 75% 5 5.5 7.5 > quantile(Height, c(.25, .5, .75)) # also try summary(Height) 25% 50% 75% 95.25 102.9 116.2 > cor(Age, Height) [1] 0.9884 > cor.test(Age,Height) Pearson’s product-moment correlation data: Age and Height t = 13.03, df = 4, p-value = 0.0002 alternative hypothesis: true coef is not equal to 0 sample estimates: cor

1.5. METHODS FOR ENTERING AND SAVING S COMMANDS
0.9884

9

1.5

Methods for Entering and Saving S Commands

Once can choose from many approaches for developing S code, entering code interactively, and saving code that runs successfully. A few of these are as follows. 1. You can enter commands to S one at a time, directly at the command prompt. Command recall and editing (using ↑ and ↓ keys and within-line editing through the use of the Home and End keys on Windows, for example) can be of great help in correcting statements. Typing Enter while the cursor is anywhere inside the command will cause that line to be executed. 2. Commands can be written in an editor window (Notepad, Emacs, Xemacs, Word, Xedit, PFE, WinEdt, UltraEdit, NoteTab, etc.) and then you can highlight/copy/paste desired commands into the S command window. You can also save the ?le every time you edit it, and bring it into S using the source command. You can save typing by doing something like:
k ← ’c:/mydir/myprog.s’ source(k) # input code to S-Plus # Move to edit window and save source(k) # redefine code to S-Plus # Or use Hmisc’s src function: src(myprog) # note absence of quotes and of .s # Move to edit window and save src() # redefines myprog.s to S-Plus # file name remembered by src

See Section 1.5.1 for details on ?le name speci?cation. 3. You can run the Emacs or Xemacs ESS package with its own interactive S window (especially in Linux/UNIX) to edit the code in an Emacs window and easily execute parts of the code. 4. After entering commands interactively, selected commands (and possibly their output) can be highlighted in the S command window and pasted into an editor window. 5. After entering commands interactively, the S History log can be copied to a ?le. 6. If your code is contained in an S function, you can have S edit the function.
myfunction ← edit(myfunction)

You may want to override the default editor, using options(editor = ’editorname’). 6 Under Windows you can also specify the editor using Options ... General Settings ... Computations. You can also use the edit function to edit objects. This is especially handy for character
6 For Windows Emacs you would use for example options(editor=’gnuclientw’). This will cause the Emacs server (assuming Emacs had already been invoked before running the edit command) to open a new bu?er containing the character representation of the object being edited, but to not return control to S until the bu?er has been closed using for example Ctrl x #.

10

CHAPTER 1. INTRODUCTION strings. In the following example, the levels of a categorical variable are changed interactively:
levels(disease) ← edit(levels(disease))

A major problem with the use of edit is that if a function contains syntax errors you will lose any changes made. 7. The fix function is an easier to use version of edit for editing functions and other objects:
fix(myfunction) fix() # # # # assigns result to myfunction edit myfunction again - also allows editing of file used in previous invocation of fix when file contained syntax errors

When ?rst learning S, method 1 is very expeditious. After learning S, method 2 has some advantages. One of these is that multiple-line commands that are not part of functions can easily be re-executed as needed. Windows S-Plus has a builtin script editor which includes a facility for syntax checking code before it is submitted for execution. It also provides for easy submission of selected statements for execution after they are highlighted in the script editor. One of the advantages of saving all the S code in a ?le is that the program can be run again in batch mode if the data or some of the initial commands change. For managing analysis projects we have found it advantageous to have a “History” ?le in each project directory, where key results and decisions are noted chronologically. The History ?le can be constructed by copying and pasting from a batch output listing ?le or from the command window if using S-Plus interactively. Other options for saving pieces of output include the sink function described in Section 3.5.4, and running the program in batch mode as described in Section 13.1. As alluded to above, Windows S-Plus has a new option for entering and editing code and saving results. You can open an existing “script” ?le (su?x .scr) by clicking on File : Open... or start a new one by clicking on File : New. You can submit code for execution using the F10 key. If you highlight code, F10 will cause only the highlighted code to be executed. Otherwise, the entire program will be executed. You can also highlight a function name (if it is a built-in function), right click, and select Help to see that function’s documentation. By default, results will be displayed in a lower part of the window showing your code. You may want to drag the horizontal bar separating the program from its output to allow more space for the output window. You can control where results are outputted by clicking on Options then Text output routing. One place to store output is a Report window, which can be saved to a ?le in rich text format (rtf). Unlike the lower half of the script program window, the report window has a scroll bar that makes it easy to show analyses done much earlier. After clicking on Options : Text output routing : Report click on Window Tile Vertical to see the report window alongside the program window. Another advantage of the report window is that you can copy from the graph sheet into the report. If you want to store a program you’ve edited in the script program window click on File : Save or File : Save As. If you do use a su?x in the ?le name box, the su?x will be .scr. If you name a su?x such as .s, that su?x will be used instead. If you like .s to denote S programs as many users do, you will have to click on File : Open then select All Files to view non–.scr ?les.

1.6. DIFFERENCES BETWEEN S AND SAS

11

In S-Plus for Windows, the Script editor does bracket, brace, and parenthesis matching and context-sensitive indenting. By default, it will also type the matching right brace when you type a left brace. Those who want commands executed immediately (without hitting F10) should open a command window. The output from commands can be interspersed with the commands or they can be directed to a report window.

1.5.1

Specifying System File Names in S

In UNIX, directory levels are separated using / both at the system prompt as well as inside S. In Windows, ?le names use \ outside of S (e.g., when de?ning shortcuts or in pop-up windows from the S-Plus File ... menu). Inside Windows S you must use \\ inside quoted ?le names. You can also use / (single slash), as S is kind enough to translate / to \\. \\ is used instead of \ because inside a quoted character string \ is considered an “escape” character that modi?es the meaning of another character. For example, the character string ’\n’ is a newline character.

1.6

Di?erences Between S and SAS

Four of the most important distinctions between S and SAS are (1) the S language was designed to be extendable; (2) it is very easy for users to write their own S functions; (3) SAS graphics require a large amount of programming, are non-interactive, are in?exible, and have poor appearance; and (4) SAS is much more e?cient than S for analyzing very large databases. On (1), S makes it very easy for users to add to the basic S language. For example, they can add new operators and new data attributes such as comment attributes for variables or data frames and ?ags to mark that some values are imputed. Regarding (2), when SAS ?rst began to be widely used around 1969, it was very easy for users to write their own procedures in Fortran. They could easily de?ne the notation to be used for their new PROC statement, and read SAS datasets using Fortran. Many users wrote SAS procedures, including Harrell’s PROCs PHGLM and LOGIST, which gave SAS the capability to ?t logistic and Cox regression models in 1978 and 1979, respectively. In the late 1980’s, SAS converted to a new mode for writing procedures, ?rst in PL/I then in C. The interface became much more di?cult to program, and in fact SAS started selling the interface as a separate product (the SAS Toolkit). So not only did all old SAS procedures written by users all over the U.S. become obsolete, but users had great di?culty in writing add-on procedures. On the other hand, the most basic S language texts tell you how to write your own functions, in the same language that S and S developers use. Within your own functions you can also call Fortran or C subroutines extremely easily. As a result, modern statistical methods are available in S long before they become available in SAS if at all. In terms of ease of learning, anecdotal reports indicate that S is easier to learn than SAS for users who don’t already know SAS. For previous SAS users, the vector and interactive programming orientation of S may take a bit of getting used to. The following table compares SAS with S in several areas.

12

CHAPTER 1. INTRODUCTION

Table 1.1: Comparisons of SAS and S

Feature
Numeric storage Character storage value

SAS
Floating point, 3-8 bytes

S
Integer, ?oat single, ?oat double (4, 4, 8 bytes)

value

1-200 bytes, ?xed length (although dataset may be compressed) Up to 8 letters (31 for Version 8), case-insensitive (case-sensitive for V8), special character possible:

no limit, variable length

Variable names

Any length, case-sensitive, special character possible: .

Variable labels

Up to 40 letters (256 for V8) Created by PROC FORMAT, stored separate from data ., check using x=. Values .A,...,.Z, part of standard language Treated as the smallest number, logical expression will never result in missing

Any length, user-de?ned attribute Intrinsic attribute stored with data in factor variables NA, check using is.na(x) User-added attributes created automatically by sas.get function

Value labels Standard missing values Special values missing

Missing values in logical expressions

Uses correct rules, e.g., T | NA is T, F | NA is NA, NA < 50 is NA

1.6. DIFFERENCES BETWEEN S AND SAS

13

Feature

SAS

S
Added at will. Examples: comment(x) ← ’Variable was corrected 4/1/97’ is.imputed(x), partial dates, name of image ?le containing page of data form where variable was entered As vectors or matrices data frame = list of vectors and matrices; can attach attributes to data frame Execute functions in a loop, for different subsets (subscripts) of observations, or use tapply or related functions

User-de?ned attributes

Not possible

Processing of Observations

Record by record

Dataset format

dataset = rectangular table

By-processing

Run PROC SORT then use BY statement on PROC to group analysis

Post-processing of analysis output

Some printed output not available in procedure output datasets (Output Delivery System does help). Hard for user to derive secondary estimates/simulate/bootstrap.

All calculated values are stored in objects created by functions. Easy to compute other estimates or feed output to bootstrap procedure.

Handling huge datasets (e.g., 100,000 observations on 50 variables)

Limited only by disk space

Limited by memory. Will be very slow if data must be stored in virtual memory that is swapped to disk.

14

CHAPTER 1. INTRODUCTION

Feature

SAS

S
Faster for small-moderate datasets; slower for large ones if use virtual RAM General, slower

Speed

Linear in dataset size

Merging Inputting Data Raw

General, e?cient Flexible, reads non-standard data formats Separate DATA and PROC steps executed in batch mode

Flexible for ASCII ?les

Processing steps

Line-by-line interactive, can mix data generation and analysis steps User writes functions using standard S language. No symbolic macros are needed. Commands are “live”, i.e., can sense data values and attributes at time of execution. For example, the describe function has a statement like the following to give output appropriate to the type of input variable: if(is.category(x) | length(unique(x)) < 20) table(x) else quantile(x). Easy to call C or Fortran routines from S functions. User-written online help looks builtin. Intrinsic part of language

User-written procedures

Computational modules can be written using a separate procedure (IML). Can not mix standard PROCs using this mode. Symbolic macro language can mix PROC,DATA steps. Macro language is harder to write and is not “live”. PROCs are very di?cult to write, and users cannot add online help ?les for them.

Vector and matrix operations

Available while running PROC IML

1.6. DIFFERENCES BETWEEN S AND SAS

15

Feature

SAS

S
Visible for most functions by typing function name. Can learn from, adopt, modify, correct system functions. interactive and batch, best statistical graphics available

System Code

Source

Not available

Graphics Handling of categorical variables in regression models

non-interactive, di?cult to program, restrictive, ugly Some procedures allow CLASS statement and generate dummy variables; many do not. One or two procedures will generate quadratic terms; most require user to code nonlinear component variables. Few PROCs will generate these; users must code products (in DATA step) and test them manually

Dummies always automatically generated

Nonlinear e?ects in models

All models allow general transformations of predictors directly in the model formula

Interaction e?ects

Automatic

Tests of nonlinearity and pooled interaction e?ects Plot how each predictor is represented in model

Must be done manually

Automatic when using the Design library

Must create auxilliary datasets and program

Single statement using Design

16

CHAPTER 1. INTRODUCTION

Feature
Robust covariance estimation for ?tted models Model validation

SAS
Macros for “sandwich” estimator available for certain models

S
“Sandwich” or bootstrap, with cluster sampling adjustment, available using a single statement with Design Single statement using Design If saved result of ?tting function (“?t object”) can obtain predictions for any desired predictor settings using predict(fit,...) or using the Design library’s Function function E?ect plots and nomograms with Design General method using Hmisc’s aregImpute and impute functions BUGS package interfaces with S A few models are available, including nonlinear mixed models; computational properties not as good as PROC MIXED.

Not available

Computing Predicted Values

Must create dataset containing predictor settings, add to original dataset, and re-run model ?t

Graphical summary of model Missing value imputation Bayesian ence infer-

Not available

PROC MI for linear imputations models with normal distributions Not available

Mixed models

PROC MIXED for linear models has nice features for Gaussian, binary, Poisson responses

1.6. DIFFERENCES BETWEEN S AND SAS

17

Feature

SAS

S
More general penalized MLE for linear Gaussian model and binary and ordinal logistic models, with di?erential penalization by type of term in model

Penalized maximum likelihood estimation

Ridge regression for linear Gaussian models

Penalized estimation with variable selection Tree (CART) models

Not available

lasso function in Statlib

Not available except in Enterprise Miner Recently available Extremely slow PROC IML macro; new features for V8

rpart function and graphical representation Builtin

Generalized additive models Nonparametric smoothing

Builtin, variety of smoothers

18

CHAPTER 1. INTRODUCTION The following table lists SAS procedures and corresponding S functions. In this table, ols,lrm,psm,bj,
Table 1.2: SAS Procedures and Corresponding S Functions

SAS Procedures ANOVA REG,GLM LOGISTIC LIFEREG LIFETEST PHREG FREQ TABULATE MEANS,SUMMARY,UNIVARIATE CORR VARCLUS PRINQUAL BY statement

S Functions aov lm,glm,ols,bj,manova glm,lrm survreg,psm,bj surv.diff,survfit,cph coxph,cph table,crosstabs,summary.formula mantelhaen.test,fisher.test,chisq.test summary.formula mean,var,quantile,summary,describe corr,rcorr varclus transcan tapply,by,aggregate,split,summary.formula, summarize,for

and cph are from the Design library, and summary.formula, summarize, rcorr, describe, varclus, and transcan are from the Hmisc library. Other functions are built-in.

1.7

A Comparison of UNIX/Linux and Windows for Running S

The UNIX/Linux operating system is a better environment for software developers because of the wide variety of tools available7 . UNIX/Linux is also a good choice if you are processing large databases, as it is cost-e?ective to have a “compute server” on your UNIX/Linux network that can be used by many users for large applications8 . Having used both UNIX and Windows extensively, we feel that UNIX (and hence Linux) is a more e?cient and reliable environment for every day S users, as UNIX window navigation is more e?cient than Windows. Windows users tend to spend too much time navigating menus and Windows operates signi?cantly slower than Linux because of the design and massive size of Windows operating systems. However, the greatest advantage of UNIX is probably that a nice system administrator would have already installed the tools you need, A including Emacs, Ghostview, L TEX, and a variety of print utilities. Many versions of Linux come with all of these tools automatically. But Windows has a few advantages also: (1) ease of installing add-on S-Plus and R libraries; (2) faster online help for S-Plus; (3) outputting graphs in Windows
7 Windows users can not-so-easily install versions (e.g., GNU) of many of the UNIX tools, such as a bash shell command window. 8 You can also program a UNIX system to compress large databases that haven’t been read in a week. That way your disks will not ?ll up nearly as quickly.

1.8. SYSTEM REQUIREMENTS

19

meta?le and other formats for easy inclusion and editing using Microsoft PowerPoint or Word9 ; and (4) only the Windows version of S-Plus has menus for doing standard analyses and graphics. S-Plus 6 is available for UNIX, Linux, and Windows. This has resulted in a partial convergence of Linux/UNIX and Windows S-Plus, with a more or less a common graphical user interface10 . See http://biostat.mc.vanderbilt.edu/s/howto/linux.setup.html for more information on setting up a Linux system and installing software of interest to data analysts.

1.8

System Requirements

For UNIX/Linux a minimum amount of RAM is 64MB. For PCs, 128MB is minimal. If you will be analyzing large databases (roughly speaking, > 40000 observations), you may need at least 256MB of RAM. For analyzing very large databases (say > 100000 observations), more than 256MB of RAM will usually be needed. Windows 2000 and XP use memory much more ine?ciently than earlier versions of Windows, so add more RAM accordingly. RAM is cheap, so it’s best to order your PC with 256MB. If you have only occasional need for more than 256MB of RAM, you may want to endure the slowness of virtual memory for those applications. A minimum PC CPU for running Windows S-Plus is a 400 MHz Pentium. R requires less memory to run than S-Plus.

1.9

Some Useful System Tools

There are several system tools that can greatly assist the S user. UNIX users usually have an advantage in that their system administrator would have already installed most of the tools, and many linux packages come with all of the important tools pre-installed. For Windows users, Web addresses for obtaining the software are provided. biostat.mc.vanderbilt.edu/EmacsLaTeXTools A has a large amount of information on obtaining an installing Emacs, L TEX, and related programs. Emacs editor: Emacs is an incredibly powerful editor for editing text ?les of various types. Emacs is especially powerful for editing S code, as it has a special mode which highlights di?erent kinds of S statements in di?erent colors or fonts and it does indentation according to the level of nesting. It also makes it easy to check for matching parentheses, brackets, and braces. Emacs for Windows (all 32MB of it when uncompressed!) is available from ftp://ftp.gnu.org/gnu/windows/emacs/latest. Harrell’s version of the Emacs startup ?le (.emacs) is available from the Utilities area of the UVa Web page This .emacs ?le has several useful default settings for how Emacs operates. S-mode for Emacs using the ESS Emacs package may be obtained from http://software.biostat.washington.edu/statsoft/ess. S-mode can also run S-Plus or R itself, allowing for such capabilities as object name completion in the editing window if you enter the ?rst few letters of an object’s name. This mode is known to work well under UNIX/Linux.
9 Windows S-Plus can output graphics directly into Powerpoint Presentation format as well as Adobe Acrobat .pdf ?les (see below), and R can make .pdf ?les. Note however that using Windows meta?les to include graphics into Microsoft O?ce applications frequently does not preserve all aspects of the graphics. Postscript is still the most reliable graphics format. 10 This version is based on the version 4 engine of the S language, which will require some functions to be modi?ed unfortunately. All modi?cations have been made in Harrell’s libraries.

20

CHAPTER 1. INTRODUCTION Windows users may ?nd that Xemacs is a bit more user-friendly, and Xemacs has a menu for automatically downloading and installing packages such as ESS. Like Emacs, Xemacs can be automatically installed when you install Linux. Windows users may obtain Xemacs from www.xemacs.org.

Ghostview: This is a previewer for postscript graphics and documents. It is available for Windows from http://www.cs.wisc.edu/?ghost/. Ghostview comes with Ghostscript, which can convert postscript ?les to .pdf ?les (but not as e?ciently as Adobe Acrobat) among other things.
A L TEX: This system is excellent for composing technical documents and advanced tables. It is the typesetting system used to make this document, and it is used by many book publishers. An A excellent commercial version of L TEX for Windows can be obtained by contacting Personal TEX Inc. at texsales@pctex.com or http://www.pctex.com. If you want to be able to produce electronic documents (e.g., .pdf ?les) with hyperlinks, the full TEX package from Y&Y Inc. A is recommended. See www.YandY.com. Perhaps the best versions of L TEX for Windows are free versions, FPTEX by Fabrice Popineau and MikTEX, both available at http://www.ctan.org. FPTEX’s DVI previewer allows postscript graphics to be displayed, assuming you have installed Ghostscript. Several tools for creating .pdf ?les are also included in FPTEX. See http://ctan.tug.org/tex-archive/info/lshort/english/lshort.pdf for a nice free book for learning A L TEX.

Adobe Acrobat Reader: Available from www.adobe.com, this free program nicely displays .pdf ?les. You can create these graphics ?les directly in Windows S-Plus using the pdf.graph device function. Occasionally this will get around printer memory problems when printing complex graphs, and a few graphs can only be faithfully printed in Windows this way. Meta?le Companion: This program, for which a free trial version is available from www.companionsoftware.com, allows you to edit Windows meta?les, a graphics format you can produce either with the dev.print function in S 3.2+ or using the File ... Export Graph dialog. Metafile Companion is one of the nicest graphics editors available anywhere. It allows you to edit any detail of the graph. Mayura Draw: This shareware program is a nice scienti?c drawing program. It can take as input an Adobe Illustrator ?le, which can be converted by Ghostscript from a postscript ?le. Using that combination of programs gives you the ability to nicely edit postscript graphs. See www.mayura.com for information about Mayura Draw. graphviz: This is an amazing command language from AT&T for drawing complex tree diagrams. Linux, UNIX, and Windows versions are available from http://www.graphviz.org Xmouse: You can make a Windows 95 mouse work like a mouse in UNIX X-windows by installing Microsoft’s PCToys package and running its Xmouse program. That way when you move the mouse from an editor window to the S command window you do not need to click the left mouse button to make the S window have the mouse’s focus. This really helps in copying text from the editor to S. Also, if you had to click the left mouse button, the editor window would usually disappear. For Windows 95, obtain Xmouse from the Powertoys package at www.microsoft.com/windows95/downloads/contents/wutoys/w95pwrtoysset. For

1.9. SOME USEFUL SYSTEM TOOLS

21

Windows 98, this functionality is in the tweakUI package that is an optionally installed component of the Win 98 installation disk. With Windows 98 tweakUI you can also specify an option to have the “currently focused on window” automatically move to the top. UltraEdit: Users who want a powerful programmer’s editor that is not as comprehensive (or as large) as Emacs may want to consider buying UltraEdit (www.idmcomp.com). WinEdt: Next to Emacs this is probably the best editor for Windows/NT users, especially when A used in conjunction with L TEX. Trial and licensed copies may be ordered from www.winedt.com. NoteTab: This is a nice editor for Windows that has a ?exible macro language for making the editor language-sensitive and allowing submission of code to an open window (using ctrl-space (repeat last macro)). A free version is available from www.notetab.com. Dieter Menne (<dieter.menne@menne-biomed.de>) wrote the following macros for using NoteTab with R. ^!FocusDoc ;Save the file if it has been modified ;^!Save ;Select the highlighted block. ^!If ^$GetSelSize$ = 0 END ELSE SelectLines :SelectLines :GetSelection ^!Set %AnyText%=^$GetSelection$ ;Write the selected text to a temporary file in the Windows temp. dir. ^!Set %fileName%=^$GetTmpPath$std0001.r ^!Set %fileName%=^$StrReplace(\;/;^%fileName%;True;False)$ ^!TextToFile ^%fileName% ^%AnyText% ; Copy "source" to the clipboard ^!SetClipboard source("^%fileName%") ; Switch to R ^!FocusApp RGui* ;ESC to clear the Command window, paste the command, hit enter ^!Keyboard ESC ^!Keyboard CTRL+V ^!Keyboard ENTER Dieter also wrote the following reg-?le to start R from Windows Explorer . REGEDIT4

[HKEY_CLASSES_ROOT\Directory\shell\Run R]

22

CHAPTER 1. INTRODUCTION [HKEY_CLASSES_ROOT\Directory\shell\Run R\command] @="\"C:\\Program Files\\R\\rw1050\\bin\\Rgui.exe\" --internet2"

A TeXmacs: This is a WYSIWYG front-end to L TEX for Linux and UNIX users that gives you a full equation editor. It is available from www.math.u-psud.fr/?anh/TeXmacs/TeXmacs.html.

PFE: A nice small and free programmers editor is PFE which may be downloaded from http://www.lancs.ac.uk/people/cpaap/pfe/. PFE is an excellent replacement for NOTEPAD even if you just use it for viewing ?les. If PFE is already open and you invoke it on another ?le, it will add the new ?le to the list of ?les it is currently managing. Emacs can do this using its GNUCLIENT feature. To use PFE as your default editor, you can issue the S command options(editor=’c:/pfe/pfe32’) if pfe32.exe is stored on the c:\pfe directory, or enter the Options ... General Settings ... Computations dialog. Microsoft Word Damien Jolley (djolley@ariel.ucs.unimelb.EDU.AU) wrote a Microsoft Word macro that allows one to execute send highlighted code to S for execution. His macro de?nition follows. Sub MAIN If SelType() <> 2 Then EditSelectAll ’Select all if none current EditCopy SendKeys "%w1+{insert}{enter}^{F6}" AppActivate "S-PLUS for Windows" End Sub A Word 97 version of the macro follows. Public Sub MAIN() If WordBasic.SelType() <> 2 Then WordBasic.EditSelectAll ’Select all if none current WordBasic.EditCopy WordBasic.SendKeys "%w1+{insert}{enter}^{F6}" WordBasic.AppActivate "S-PLUS for Windows" End Sub To quote from Damien: “I have this stored as a macro which I can execute from a user-de?ned button on the Toolbar. So, when I’m ready to test my bit of code, I just click the button, and Windows switches over to S-Plus, copies the code into the S-Plus command bu?er and execution takes place immediately. I use ALT-TAB to return to Word either to ?ddle with the code or to save it to a text ?le”. To enter the macro, record and then edit a macro. You start the recorder and enter a random command, then stop the recorder and give the macro a name. Then edit it to make the real macro. John Miyamoto jmiyamot@u.washington.edu has a series of Word 6 macros for interfacing with S-Plus. These macros are available from the Utilities area under Statistical Computing Tools on the UVa Web page.

1.9. SOME USEFUL SYSTEM TOOLS JED This is a nice small version of Emacs available from John Davis at http://space.mit.edu/?davis/jed.html.

23

24

CHAPTER 1. INTRODUCTION

Chapter 2

Objects, Getting Help, Functions, Attributes, and Libraries

2.1

Objects

In SAS, one has several concepts which refer to di?erent types and characteristics of data, like data ?les, data views, data catalogs, format catalogs, libraries, etc. You get results from these data by using a PROC step. S has di?erent entities representing data such as vectors, factors, matrices, data frames, lists, etc. These entities have di?erent characteristics called attributes such as names, class, dim, dimnames etc., and we get results by applying functions to them. In general, any entity in S is designated by the general name of an object. The names of objects in S can be of any length, and can contain digits, mixtures of lower and upper case letters, and periods. Names may not contain underscores and may not start with a digit. In some cases you will want the names to be very descriptive (e.g., age.years) but in other cases it’s best to use a short name (e.g., age) and then to assign a longer label as an attribute 1 . Names in S are case–sensitive, so that vectors age and Age would refer to two di?erent objects. This can be handy for distinguishing between various versions of the same basic information. For example, age might refer to the original age variable whereas Age might refer to age values after certain data corrections or missing value imputations.

2.2

Getting Help

Suppose we want to get help on a function, and see if it has any options that we may want to use. There are several ways to do this. A very simple one is to type ?mean (or whatever the name of the
1 This can be done using the label function which is in the Hmisc library described below, e.g., label(age) ← ’Age in years’. When using the sas.get function to convert SAS datasets to S data frames, SAS labels are automatically carried to S in this fashion.

25

26CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES function is). Equivalently, we could type help(mean). In the case that the function contains special characters, its name should be enclosed in quotation marks, thus help("%*%") means help for the matrix-product function.
>?mean Mean Value (Arithmetic Average)

DESCRIPTION: Returns a number which is the mean of the data. A fraction to be trimmed from each end of the ordered data can be specified. USAGE: mean(x, trim=0, na.rm=F) REQUIRED ARGUMENTS: x: numeric object.

Missing values (NAs) are allowed.

OPTIONAL ARGUMENTS: trim: fraction (between 0 and .5, inclusive) of values to be trimmed from each end of the ordered data. If trim=.5, the result is the median. na.rm: logical flag: should missing values be removed before computation? VALUE: (trimmed) mean of x. DETAILS: If x contains any NAs, the result will be NA unless na.rm=TRUE. If trim is nonzero, the computation is performed to single precision accuracy.

When you use either of these two forms of help, the system looks for a ?le in some directory and then displays the help ?le. This means that a window will pop up with options to print the help ?le, search for character strings, etc.2 If you are running on a UNIX workstation, you may want to initiate the interactive help system. Type help.start() and a window listing all functions and categories of functions will appear. Just click on the one you want help about, and a new window will pop-up with help speci?cally on that function. You can then look at it, close it to keep it around or send it to the printer. With this method you can also type something like regres* in the topic ?eld of the help window, to get a list of all functions which start with ‘regres’. The disadvantage is that this is slower. To quit the window type help.off(). A third way, if you don’t want full help but to just be reminded of what the arguments to the function are, is to use the args function built in to S.
2 Under UNIX X–Windows it is bene?cial to use e.g. options(pager=’xless’) to use a full–screen pop–up window instead of the system default in which the less command is run inside of the S command window.

2.2. GETTING HELP
> args(mean) function(x, trim = 0, na.rm = F)

27

The function has three arguments, x for the vector of which we want the mean, trim= if we want trimmed means, na.rm=, to remove missing values. The defaults are trim=0 and na.rm=F. Here T is the logical true value, so we interpret na.rm=T as saying that the na.rm argument is turned “on.” If you name the arguments, they can be given in any position. For example mean(x,na.rm=F,trim=.5). See Section 2.3 for more about functions and arguments. You can also use names(functionname) to list the arguments, or functionname$argumentname to list the default argument value. A quick way to get an alphabetic listing of a function’s arguments is to type sort(names(function.name)). Note that there is an extra element with a blank name that should be ignored.
> sort(names(mean)) [1] "" "na.rm" "trim" "x"

The object orientation of S can make it di?cult to know the full name of the function you are really using. For example, if you need help in plotting a logistic regression ?t using the Design library, you may not know that the pertinent plot function is plot.Design. You can get a list of all of the plot methods by typing methods(plot). You can get a list of all of the methods for handling the ?t object by typing methods(class=class(f)) if the ?t object is f. If you are having troubles understanding what the function does or how it is doing things, you can always look at the function itself.
> mean function(x, trim = 0, na.rm = F) { if(na.rm) x <- x[!is.na(x)] else if(any(is.na(x))) return(NA) if(mode(x) == "complex") { if(trim > 0) stop("trimming not allowed for complex data") return(sum(x)/length(x)) } x <- as.double(x) if(trim > 0) { if(trim >= 0.5) return(median(x)) n <- length(x) i1 <- floor(trim * n) + 1 i2 <- n - i1 + 1 x <- sort(x, unique(c(i1, i2)))[i1:i2] } sum(x)/length(x) }

Yet another possibility is to look at the help ?les without even starting S-Plus. You may ?nd yourself in this situation if you are running a job in batch mode and want to ?nd out why it didn’t work.

28CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES In UNIX it’s easy to de?ne shell programs to facilitate this, as well as to list help ?les associated with keywords. Under Windows, you can use Explorer or My Computer to click on a .hlp ?le in the main S-Plus area or in an add–on library area (see below). Last, but not least, consult the back of the blue book or the S-Plus User’s manual. The help here is exactly the same as the on-line help but not all functions are listed. In S-Plus Version 4.x and later the manuals are online with some search capability. The following is a list of major help topics for S-Plus as it is distributed from MathSoft. This list will help in understanding the components of the system as well as how you can ?nd a function when you don’t know its name. In Windows you could click on any of these topics to see all functions related to that topic. In UNIX you use the help.start() command to put up the list of topics. Add to Existing Plot All Datasets ANOVA Models Categorical Data Character Data Operations Clustering Complex Numbers Computations Related to Plotting Customizable Dialog functions Customizable Menu functions Data Attributes Data Directories Data Manipulation Data Types Dates Objects Demo Library Demonstration of S-PLUS Deprecated Functions Documentation Dynamic Graphics Error Handling Graphical Devices High-Level Plots Input/Output–Files Interacting with Plots Interfaces to Other Languages Library of Chapter 11 Functions from The New S Language Library of Chronological Functions Library of Drawing Functions from Programmer’s Manual Library of Examples from Programmer’s Manual Library of Examples from The New S Language Linear Algebra Lists Loess Objects Logical Operators

2.3. FUNCTIONS Looping and Iteration Mathematical Operations Matrices and Arrays Methods and Generic Functions Miscellaneous Multivariate Techniques Non-linear Regression Nonparametric Statistics Optimization Ordinary Di?erential Equations Printing Probability Distributions and Random Numbers Programming Quality Control Regression Regression and Classi?cation Trees RELEASE NOTES Robust/Resistant Techniques Smoothing Operations S-PLUS Session Environment Statistical Inference Statistical Models Survival Analysis Time Series Trellis Displays Library Utilities

29

2.3

Functions

You are starting to see that unless you are using the pull–down menu system in S-Plus, almost everything is done by calling functions3 . A function is an object in S and in many ways it can be operated on as data. Most functions have arguments that pass values to the function for it to work on or to specify detailed options on how it should do its work. It is common for example to pass a vector of data (representing a single variable) to a function along with scalars or other shorter vectors specifying options such as con?dence levels, quantiles, plotting and printing options, etc. Arguments are given to the function either by name or by their sequential position in the series of arguments. It is very common to specify a “major” argument without its name, in position one, then to specify “minor” arguments by name. This is because there are so many “minor” arguments and it is hard (and risky) to try to remember their order. For example, we can compute the mean age using the command mean(age, na.rm=T), which means to compute the mean of the age vector ignoring missing values. We could use the equivalent statement mean(age, , T), i.e., we can assign the logical “true” value (T) to the third argument to mean, which we can see from mean’s help ?le is the na.rm argument. The extra comma is a placeholder to specify that we are not specifying the
3 Menu

choices are actually executed by secretly calling functions.

30CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES second argument which is trim. trim will receive its default value of zero. As mentioned above, this is a dangerous method so we prefer mean(age, na.rm=T). When we examined the help ?le for the mean function we saw na.rm=F in the list of arguments. This means that the default value for na.rm is F, so that na.rm will be assumed to be F if you do not specify this argument. Default values can also be vectors, lists, matrices, and other objects as the need arises. Often you will see that the default for an argument is a vector of values when the argument really needs to be a scalar. In these cases, the vector of values frequently speci?es the list of possible values of the argument, with the default value listed ?rst. For example, look at the argument list for the residuals.lm function:
> args(residuals.lm) function(object, type = c("working", "pearson", "deviance"))

Here the type argument can take on three possibilities. If you do not specify type, ’working’ residuals will be computed.

2.4

Vectors

A statement to create a vector interactively could be something like this
> x ← c(3.1,2.6,3.4,5.9,7.6)

In creating x we used two S operators, the assignment statement “←” which is read “x gets ...” and the concatenation function c(). A synonym for ← is the underscore sign (_). Of course the assignment could have been written in a reversed way,
> c(3.1,2.6,3.4,5.9,7.6) → x or > c(3.1,2.6,3.4,5.9,7.6) _ x

Two or more assignments could be made on the same line if separated by a semicolon. A line could also be split among two or more lines. Just hit return at the end of your line and you will get a continuation prompt "+" at the beginning of the next line, then continue typing. You can concatenate two or more existing vectors and include other data as an argument to the c() function.
> y ← 10.6:2.3;z ← c(x,c(1,2,3),y + y^2) Syntax error: name ("y") used illegally at this point: z ← c(x,c(1,2,3),y y

Here, we forgot the comma after y on the ?rst line.
> y ← 10.6:2.3;z ← c(x,c(1,2,3),y, + y^2,y+1)

If we want to see what’s stored in the vectors y and z just type their names
> y [1] 10.6 > z 9.6 8.6 7.6 6.6 5.6 4.6 3.6 2.6

2.4. VECTORS
[1] [11] [21] [31] 3.10 8.60 57.76 7.60 2.60 7.60 43.56 6.60 3.40 6.60 31.36 5.60 5.90 5.60 21.16 4.60 7.60 4.60 12.96 3.60 1.00 3.60 6.76 2.00 3.00 2.60 112.36 11.60 10.60 10.60 92.16 9.60 9.60 73.96 8.60

31

There are several things to notice here. First, the operator a:b produces a sequence from a to b starting with a and adding (or subtracting) 1 to each element until you get a number greater in absolute value than |b|. (You may want to experiment to see what happens when a or b are negative). Second, we have y^2 which just squares each element of y. All functions which return a single numerical result from a single numerical argument such as exp,sqrt,sin,cos,tan,atan,log, etc. act on each element of the vector. Finally, adding a number to a vector just adds the number to each component of the vector. What happens if we add two vectors of di?erent length? Let’s see.
> x ← 1:9 > y ← 1:10 > x+y [1] 2 4 6 8 10 12 14 16 18 11 Warning messages: Length of longer object is not a multiple of the length of the shorter object in: x + y

When adding (or subtracting) two or more vectors of di?erent length, the shorter vectors are recycled until they reach the length of the longest vector and then the operation is performed and a warning message is issued. Also notice that we did not assign the result of the sum, but printed it directly instead. To list vectors left over from a previous session, use objects(). To delete them, use rm(x,y,z) where x, y and z are the vectors to be deleted. This function works in exactly the same way with objects other than vectors. You can also use the more versatile remove function to delete objects, e.g., remove(c(’x’,’y’,’z’)). Next, let us do some statistics on these vectors. How many observations do we have? What is the mean? And the standard deviation?
> length(z) [1] 35 > mean(z) [1] 17.384 > sqrt(var(z)) [1] 26.59949

2.4.1

Numeric, Character and Logical Vectors

All elements of a vector must be of the same type, that is integers, real numbers, complex numbers, logical values (T or F), or character strings. Examples of each kind are c(3,6,9), c(1+2i,.2,-3-5.6i),(T,T,F) and c("x","y","z"). To determine what kind of vector we have, we could type mode(x) and this will return a character string telling us the kind of vector. It is also possible to assign a value to the mode of a vector forcing it to be something else.
> x ← c(3.1,2.6,3.4,5.9,7.6)

32CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES
> x [1] 3.1 2.6 3.4 5.9 7.6 > mode(x) [1] "numeric" > mode(x) ← "character" > x [1] "3.1" "2.6" "3.4" "5.9" "7.6" >

There are a number of functions to test for the mode of a vector and to change it. In general, if we try to operate on a vector whose mode is not appropriate for that kind of operation, S will automatically convert it to another kind trying to lose the least possible amount of information in the process. Thus, c(T,F)+c(3,4) yields c(4,4) (Fs are converted to zeros and Ts are converted to ones). The functions to test and change modes are
is.numeric,as.numeric is.character,as.character is.logical,as.logical

A useful function in the Hmisc library which may save you some typing is Cs(a,b,c,d). It is equivalent to c("a","b","c","d") but it won’t work if your character strings have an _ in them (since _ is equivalent to ←).

2.4.2

Missing Values and Logical Comparisons

Missing values in numeric and logical vectors are represented by the symbol NA (not available). In general, any operation (mathematical or logical) performed on a missing value will return a missing value. The logical operators are >, >=, <, <=, ==, !=, &, |,!. Notice that the operator to test equality is == rather than =, which is reserved for named arguments to a function. ! is used for negation and & and | for logical ‘and’ and ‘or’. Consider for instance
> x ← c(3,6,9,10,2.2,NA,NA,6.7); > x > y > [1] T F F T NA NA NA T y ← c(1,6,9,2,NA,5.1,0,-1)

The operator == is not appropriate to test for missing values. Instead, use the function is.na.
> is.na(x > y) [1] F F F F T T T F

Suppose that we have two vectors of the same length, and we want to know the joint distribution of their missing values.
> x [1] > y [1] 1 2 1 2 1 NA NA 2 2 2 2 2 2 1 2 1 2 NA 1 1

2 NA

4 NA

One way would be to use the table function

2.4. VECTORS
> table(is.na(x),is.na(y)) FALSE TRUE FALSE 7 2 TRUE 3 0

33

You can also tabulate all patterns of NAs using the builtin function na.pattern (but note that na.pattern was omitted from S-Plus 2000):
> na.pattern(list(x,y)) 00 01 10 7 2 3

Also see the naclus function described under the varclus function in the Hmisc library discussed below.

2.4.3

Subscripts and Index Vectors

It is possible to select subsets of a vector by subscripting or indexing its elements. This is equivalent to using a WHERE statement in SAS, but it is more ?exible. The expression to use is x[i] where i could be another vector, or an expression which evaluates to a numeric, logical or character vector. In all cases, we’ll think of the elements of x as being subscripted by the indexes 1:length(x) when [ ] is not present. 1. If i is a numeric vector, all its elements must be >=0 or all <=0 (NAs are allowed). Before selecting the subset, S drops all zeros from the index vector. If all elements of i are positive, then x[i] selects only those elements of x whose subscripts match the elements of i. If the elements of i are negative, then x[i] selects the elements of x whose subscript does not match any element in i. If the kth element of i is NA then the kth element of x[i] will be NA as well. (0s are ignored). i can be any length. 2. If i is a logical vector, it is indexed starting at 1 and those elements of x whose subscripts have a value of T in the corresponding index of i are selected. The same rule as in 1. apply to NAs. For this case, the length of the index vector should equal length(x). 3. If i is a character string (of any length), the rules are a little bit di?erent. In this case x is required to have what’s called a names attribute. A names attribute is a vector of character strings of the same length as x which e?ectively names each element of x. Assuming that x already has a names attribute, the expression x[c("a","b")] selects the ?rst element of x named a and the ?rst element named b. We will talk more about names when we discuss attributes in general. Examples:
> x [1] 3.0 > y [1] 1.0 > x[3] [1] 9 6.0 6.0 9.0 10.0 9.0 2.0 2.2 NA NA 5.1 NA 6.7

0.0 -1.0

34CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES
> x[1:3] [1] 3 6 9 > x[-2] [1] 3.0 9.0 10.0 2.2 NA NA 6.7 > x[c(F,T,T,F,F,F,F,F)] [1] 6.0 9.0 > x[x > y] [1] 3.0 10.0 NA NA NA 6.7 > x[!is.na(x)] [1] 3.0 6.0 9.0 10.0 2.2 6.7 > z ← x[!is.na(x)] # get rid of missing values

It is instructive to look at the help ?le for the subsetting operator "[" (type ?"[") and work out some examples. This is a very useful function that you will be using all the time, but is also very easy to get confused and end up selecting values that you didn’t mean to select. Try to always check that you have the right vector by using the length function. For a simple example of character indexing, let’s create a simple named vector.
> w ← c(cat=1, dog=3, giraffe=11) > w[’cat’] [1] 1 > w[c(’cat’,’giraffe’)] [1] 1 11

2.5
2.5.1

Matrices, Lists and Data Frames
Matrices

A collection of vectors may represent several di?erent variables in your dataset, but is not the most convenient way of handling your data. We can construct matrices by putting together vectors of the same length and the same mode using the functions cbind and rbind. The ?rst one takes its arguments and puts them together as columns of a matrix, while the second one makes them into the rows of a matrix.
> > > > > > x1 x2 x3 cx rx cx ← ← ← ← ← c(2,4,6,8,0) c(1,3,5,7,9) c(3,7,11,15,9) cbind(x1,x2,x3) rbind(x1,x2,x3,c(2,6,10,14,8))

x1 x2 x3 [1,] 2 1 3 [2,] 4 3 7 [3,] 6 5 11 [4,] 8 7 15 [5,] 0 9 9 >rx [,1] [,2] [,3] [,4] [,5] x1 2 4 6 8 0

2.5. MATRICES, LISTS AND DATA FRAMES
x2 x3 1 3 2 3 7 6 5 11 10 7 15 14 9 9 8

35

Notice that that the columns of cx are labeled and so are the rows of rx except for the last one, since the last argument to rbind was not given a name. Another way to create a matrix is to use the function matrix(data,nrow,ncol,...). This function will read data in a stream from the data argument and put it in a nrow × ncol matrix in column order by default. (In fact only one of nrow and ncol is needed if data is of length nrow*ncol). The ... represent other arguments to allow to read the data in row order and give labels to rows and columns. A useful function to use with matrices is apply. It is invoked by apply(x,margin,fun,...) where x is a matrix, margin is the dimension over which the function is to be applied (1 for rows, 2 for columns), and fun is the function to be applied to the rows or columns of x.
> apply(cx,2,mean) x1 x2 x3 4 5 9

gives us the means of the columns of cx. Actually apply can be use more generally with multidimensional arrays. Other functions related to matrices are dim, dimnames, is.matrix, ncol, nrow and t. t(x) returns the transpose of x. Matrices can be indexed in a similar way to vectors. Usually, our purpose is to select a few columns (variables we want to look at) and rows (observations) satisfying a given condition. Since we have two indexes now, we can look at both
> cx[2:5,c(2,3)] x2 x3 [1,] 3 7 [2,] 5 11 [3,] 7 15 [4,] 9 9 > cx[2:5,c("x2","x3")] x2 x3 [1,] 3 7 [2,] 5 11 [3,] 7 15 [4,] 9 9

The second example above shows another way of selecting two particular columns. Since they are named, we can just list their names in the appropriate place in the indexing bracket. If we don’t want to impose any restrictions in a particular dimension, we just leave it blank. Thus, cx[,c("x2","x3")] lists all rows of cx for columns x2 and x3. There are of course, a number of functions to do mathematical operations on matrices: *, %*%, crossprod, and outer which perform element by element multiplication, matrix product, cross products, and outer products, respectively on matrices of the appropriate sizes. To most e?ciently determine which rows of a matrix x have a column containing an NA, use the expression

36CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES
is.na(x %*% rep(1,ncol(x)))

To subset the matrix to contain only rows with all non–missing values you can use the Hmisc nomiss function, e.g., nomiss(x).

2.5.2

Lists

Lists are collections of objects of di?erent kinds. The components of a list could be vectors, matrices or other lists and they can have di?erent length and types. An example of a list is the names of the rows and columns of a matrix.
> dimnames(cx) ← list(1:5,c("x","y","z")) > cx x y z 1 2 1 3 2 4 3 7 3 6 5 11 4 8 7 15 5 0 9 9

The function dimnames is used to name the rows and columns of a matrix and it is required to be a list, so, we used the function list to create it. The arguments to list could be anything, and they can be name just as the rows or columns of a matrix.
> list1 ← list(rowmatrix=rx,dimnames(cx),c("a","b","c")) > list2 ← list(cx,indexes=1:9) > list1 $rowmatrix: [,1] [,2] [,3] [,4] [,5] x1 2 4 6 8 0 x2 1 3 5 7 9 x3 3 7 11 15 9 x4 2 6 10 14 8 [[2]]: [[2]][[1]]: [1] "1" "2" "3" "4" "5" [[2]][[2]]: [1] "x" "y" "z"

[[3]]: [1] "a" "b" "c" > list2 [[1]]: x y 1 2 1 2 4 3

z 3 7

2.5. MATRICES, LISTS AND DATA FRAMES
3 4 5 6 8 0 5 11 7 15 9 9

37

$indexes: [1] 1 2 3 4 5 6 7 8 9

Components of a list can be selected in one of two ways: the more general method extracts the component by referring to it by its position on the list. list2[[2]] selects the second component of the list list2. If the components are named, we may select them using the expression list$component or list[[’component’]]. In the example above, list1$rowmatrix selects the matrix rx. Ocassionally, you may need the unlisted results. The function unlist serves just such purpose. There is virtually no limit to what can be stored in a list, including other lists:
> + + + + + + > > > > > us ← list(Alabama=list(counties=c(’Autauga’,’Baldwin’, ’Barbour’,’Bibb’,...), pop=4273084,capital=’Montgomery’), Alaska=list(counties=c(’Aleutians East’,’Aleutians West’, ’Anchorage’,’Bethel’,...), pop=602545, capital=’Juneau’), ...) us$Alabama # Print information for one state # same as us[[1]] or us[[’Alabama’]] us$Alabama$counties # Print counties in Alabama us$Alabama$counties[1:5] # Print first 5 counties us[c(’Alabama’,’Alaska’)] # Print a sub-list containing 2 states

Section 2.6.2 provides more information on selecting elements of lists and vectors. You can see that lists provide a natural way to represent hierarchical structures. In the above example we might as well associate some data with the counties, such as the population:
> + + + + + + + > > > > > > > us ← list(Alabama=list(counties=c(Autauga=40061,Baldwin=123023, Barbour=26475,Bibb=18142,...), pop=4273084,capital=’Montgomery’), Alaska=list(counties=c(’Aleutians East’=2305, ’Aleutians West’=5259, Anchorage=251336,Bethel=15525,...), pop=602545, capital=’Juneau’), ...) # Note: need to enclose non-legal S-Plus object names in quotes sum(us$Alabama$counties) - us$Alabama$pop # should be zero us$Alaska$counties[’Aleutians East’] # print one county’s population us$Alaska$counties[’Bethel’] # print another us$Alaska$counties[c(’Anchorage’,’Bethel’)] # print two Ak ← us$Alaska # subset of list for Alaska Ak$counties # print Alaska county pops

Lists are a very convenient mechanism to summarize in one object all the information related to a particular task. Many functions give as a result a list object. For instance, most modeling

38CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES functions produce a list whose components are quantities of statistical interest. The function ols in the Design library, for example, ?ts an ordinary least squares model and returns an object of mode list. Among its components are: the model formula, vector of coe?cients, summary of missing values, and, optionally, vectors of predicted values, residuals, and the design matrix and response variable values.

2.5.3

Data Frames

Data frames are just a particular kind of list where all its components have the same length. They behave pretty much like matrices in the sense that you can operate on rows and columns and select its elements in the same way, except that the components can be of di?erent type. You may have some columns that are character vectors and other columns that are numeric or logical vectors. Moreover, an entire matrix can be part of a data frame, as long as its columns are of the same length as the other components of the data frame. They are the most similar entity to a SAS dataset that you will ?nd in S, and they are used most frequently in modeling situations, thinking of rows as observations and columns as variables. There are several ways to create data frames. First, there’s the File ... Import dialog. Second, you can read the data into a data frame from an external ASCII or SAS dataset by using the functions read.table or sas.get (to be described later), or construct it from existing objects using the function data.frame.
> obs ← Cs(id1,id2,id3,id4,id5,id6) # Hmisc shorthand for c(’id1’,...) > # Hmisc shorthand for c(’id1’,’id2’,’id3’,’id4’,’id5’,’id6’) > treat ← c(rep("Treatment 1",3),rep("Treatment 2",3)) > treat [1] "Treatment 1" "Treatment 1" "Treatment 1" "Treatment 2" "Treatment 2" [6] "Treatment 2" > x ← c(2.5,3.5,3.0,4.6,5.5,5.3) > df ← data.frame(treat,x,row.names=obs) > df treat x id1 Treatment 1 2.5 id2 Treatment 1 3.5 id3 Treatment 1 3.0 id4 Treatment 2 4.6 id5 Treatment 2 5.5 id6 Treatment 2 5.3

The argument row.names gives names to the rows of the data frame. If provided, its values must be unique. If it is not provided S will try to construct it from the arguments to data.frame. For instance, if one of the arguments is a matrix with a dimnames attribute, it will try to use that. If it can’t ?nd any vector to construct the row names, it will simply number them. The Hmisc naclus and naplot functions are useful for displaying patterns of NAs in data frames in various ways. naclus also returns a vector containing the number of missing variables for each observation. naclus does this using the statements
na ← sapply(my.data.frame, is.na) * 1 na.per.obs ← apply(na, 1, sum)

2.6. ATTRIBUTES

39

naclus also returns the mean number of other variables that are missing in observations for which variable i is missing, for i = 1, . . . . See also the builtin na.pattern function (Section 2.4.2, but note that na.pattern does not work correctly for factor variables). Data frames may be subsetted using the same notation as matrices (see Section 2.5.1).

2.6

Attributes

We have mentioned certain characteristics of S objects that are typical of that kind of object, and others that are common to all of them. Among the latter ones we can mention the length and the mode of an object. Length is easy to describe and just counts the number of elements of a vector or matrix, or the number of major components of a list. As a data frame is also a list and its major components are variables, the length of a data frame is the number of variables it contains. The mode refers to the type of object which could be numeric, complex, logical, character (these are called atomic objects) or list (which are called recursive objects). The functions to ?nd out these characteristics are length and mode respectively. The other characteristics that describe an object are referred to as the attributes of an object. They include names, dim, dimnames, class, levels, row.names and any other that you may want to create. Corresponding to each of these attributes there is a function to extract them; thus, to know the dim attribute of the matrix cx type dim(cx). To know if a particular observation is in your data frame, we could use the row.names attribute.
> row.names(df)[row.names(df)=="id9"] character(0)

The result is a character vector of length zero, meaning that said observation is not in the data frame. Here you could also just print the number of observations with that id using the command sum(row.names(df)==’id9’). In many cases the attribute determines just what kind of and object we have. For instance, a matrix (or more generally, an array) is just a vector with a dim attribute which allow functions such as apply to act accordingly. Other functions do not make that distinction and will consider it just a vector.
> length(cx) [1] 15

Attributes can be changed or deleted
> dim(rx) ← NULL # or attr(rx,’dim’) ← NULL > rx [1] 2 1 3 2 4 3 7 6 6 5 11 10 8 7 15 14 > dim(rx) ← c(5,4) > rx [,1] [,2] [,3] [,4] [1,] 2 3 11 14 [2,] 1 7 10 0 [3,] 3 6 8 9 [4,] 2 6 7 9 [5,] 4 5 15 8

0

9

9

8

40CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES rx was a 4 × 5 matrix. We ?rst made it into a vector by deleting its dim attribute and then made into a 5 × 4 matrix by assigning a new one. One could also create a new attribute with the function attr.
> # For Windows use date() to get the current date as a character value > attr(df,"creation date") ← unix("date") ; attributes(df) $names: [1] "treat" "x" $row.names: [1] "id1" "id2" "id3" "id4" "id5" "id6" $class: [1] "data.frame" $"creation date": [1] "Wed Jun 30 10:42:29 EDT 1993" > names(attributes(df)) [1] "names" "row.names"

"class"

"creation date"

In this example, the attr function assigns a new attribute called “creation date” to the data frame df, by calling the unix command “date”, for example. Next, we listed all the attributes of df using the function attributes. This, not only tells us what attributes df has, but also how they are composed. This might be too much information (specially if you have over a thousand ids in row.names). We can reduce it by typing names(attributes(...)). Notice that the attributes is in general a list with named components, which allows us to use the names function on it.

2.6.1

The Class Attribute and Factor Objects

Notice in the example above, that one of the attributes is called class. This is a very special attribute related to the concept of methods. When the class attribute is present, functions will act in di?erent ways depending on the class of the object. plot will act in a di?erent way if its arguments have a class of data.frame. As usual, the class attribute of an object can be extracted using the function class.
> class(df) [1] "data.frame"

They can also be unclassi?ed by means of unclass. The result of using unclass is that df will print as a list rather than as a data frame.
> df id1 id2 id3 id4 id5 id6 treat Treatment 1 Treatment 1 Treatment 1 Treatment 2 Treatment 2 Treatment 2 x 2.5 3.5 3.0 4.6 5.5 5.3

2.6. ATTRIBUTES
> unclass(df) $treat: [1] "Treatment 1" "Treatment 1" "Treatment 1" "Treatment 2" "Treatment 2" [6] "Treatment 2" $treat: Levels: [1] "Treatment 1" "Treatment 2" $x: [1] 2.5 3.5 3.0 4.6 5.5 5.3 attr(, "row.names"): [1] "id1" "id2" "id3" "id4" "id5" "id6" attr(, "creation date"): [1] "Wed Jun 30 10:42:29 EDT 1993"

41

(Note: there is an implicit use of the print function when you type df). Of particular interest are the objects of class “factor”. A factor is an object with a discrete set of levels like those that arise from a classi?cation variable. In SAS we could have a variable x taking k di?erent values, say x1 , . . . , xk , with formatted values l1 , . . . , lk . In S this will become a factor object with internal numeric codes 1, . . . , k and levels l1 , . . . lk . The syntax for the factor function is
factor(x,levels,labels,exclude=NA)

x is of course the vector to be factored, levels is a vector with the unique set of values of x that you want to keep in the factor, and labels is the corresponding set of optional labels for the values of x. Note the very confusing fact that the labels speci?ed to factor will become the levels attribute of the resulting vector. Those elements of x not matching any element of levels will be considered NA. The exclude argument is a vector of values to be excluded from forming levels. For instance, if x was already a vector of character strings, you may want to set exclude to "" to prevent empty strings from becoming a level. If you need to use the internal values of x rather than its levels for some reason, the function unclass comes in handy again.
> x ← c(2,2,2,3,3,3) > l ← c("2","3") > f ← factor(x,l) > x [1] 2 2 2 3 3 3 > unclass(f) [1] 1 1 1 2 2 2 attr(, "levels"): [1] "2" "3"

It is not possible to do mathematical transformations of a factor object. The reason is that factors represent categorical variables that may or may not be interval scaled or even ordinal. For example if x and y are factors, it does not make sense to add them. In summary a factor is a categorical object with a levels attribute, but which is treated internally as having the values 1:length(levels(x)). If no levels argument is provided, the sorted unique values of x are used.

42CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES

2.6.2

Summary of Basic Object Types

Table 2.1 summarizes some of the types of objects we have discussed. Note that a factor is a special case of a vector, a matrix is a special case of an array, and a data frame is a special case of a list. The table also describes how elements are selected (subscripted) from an object named x. There row and col are vectors of positive, negative, or zero–valued integers, logicals, or character strings (strings are allowed when the pertinent dimension of the object x has a names or dimnames attribute). Zero–valued subscripts are ignored, and negative values denote “get all but the subscripts listed, suppressing their signs.” When a subscript is omitted and its place is held by a comma, that means to fetch all elements of the omitted dimension. For lists and data frames, there are 3 methods for selecting elements. The ?rst of these, x[col], results in a new list or data frame containing the elements (usually variables) corresponding to col. The last 2 methods result in individual variables. There colname is the name of one of the elements (variables). Below, length is listed as an attribute although it should o?cially be labeled as a basic property of the object.
Table 2.1: Comparison of Some S Objects

Type

Description
single column of numbers (integer, single, or double precision) or character strings Usually thought of as a variable

Main Attributes
length number of elements names (optional) names of elements

vector x[row]

length no. elements names (optional) names of elements class ’factor’ levels vector of character strings de?ning labels that correspond to integer codes

factor x[row]

categorical variable, with categories coded as integers 1, 2, 3, . . .

length number of rows × number of columns matrix x[row,col], x[row,], x[,col] rectangular table of numbers or character strings dim vector of length 2 containing no. rows, no. columns dimnames list of length 2 containing a vector of row names (or NULL) and a vector of column names (or NULL)

2.7. WHEN TO QUOTE CONSTANTS AND OBJECT NAMES Type
list x[col], x$colname, x[[’colname’]]

43 Main Attributes

Description
an arbitrary collection of S objects including other lists; can be thought of as a tree; elements do not need to have equal lengths

length number of major elements names names of major elements

data frame x[col], x$colname, x[[’colname’]], x[row,col], x[row,], x[,col]

a rectangular dataset; a list in which all elements have the same number of rows. Each element in the list is a variable, and some of the variables may be matrices

length number of variables names names of variables class ’data.frame’ row.names row names (observation)

2.7

When to Quote Constants and Object Names

In S you can use single quotes, double quotes, or the Hmisc Cs function (when the symbols being quoted are legal S names) to specify character strings. Here are some general rules about use of quotes. character constants : Character constants should always be quoted when appearing in S programs. Examples:
age[sex==’female’] dframe[c(’patienta’,’patientb’,’patientc’),] sex ← ’female’

object names, general : When a data frame or an object naming a vector or matrix is used as the input to a function, do not quote the name. Here are examples:
summary(dframe) attach(dframe) summary(varname) mean(varname) attach(dframe[dframe$sex==’male’,]) summary(dframe[,c(’age’,’sex’)])

When giving a function the name of an object to create, this name is quoted (see detach in the following item). data frames : When detaching search position 1 into a data frame, quote its name. E.g., detach(1, ’newdframe’) (failing to quote data frame names in detach is a common problem that causes the search list to be corrupted). Otherwise, data frame names are generally not quoted.

44CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES variables : These names are generally unquoted except when used to select columns of a data frame, e.g., dframe[,c(’age’,’sex’)]. If you tried to use dframe[,c(age,sex)], S would combine the values of the age and sex variables and try to use these values as column numbers to retrieve. list elements : These are generally not quoted (e.g., when used with $) unless their names are not legal variable names. In that case use a statement such as objectname[[’element name’]] or objectname$’element name’ removing objects : Do not quote object names given to rm (e.g., rm(age, sex, dframe)). Quote a vector of character constants given to remove, e.g. remove(c(’age’, ’sex’)). get and assign : These functions need object names to be quoted, but not the object representing a value for assign to transmit. accessing libraries Library names are unquoted when using the library function. They are quoted when using help().

2.8

Function Libraries

S comes with over 2000 functions, organized in the main system areas and in a library of advanced graphics functions called trellis, as well as other libraries. In Windows S-Plus at least, trellis is automatically available to the user without the need of a library(trellis) command. Other series of functions which are supplied with S are organized into other libraries which must be requested for attachment by the user using the library function. For example, to get access to advanced matrix functions you can type the command library(Matrix). In version 4.5+ you can use the File ... Load Library pull–down menu to issue the library call. For libraries in need of being loaded early in the search list (i.e., those requiring first=T), check the Attach at top of search list box. Many users have developed add–on libraries of S functions for UNIX, Windows, or both platforms. Frank Harrell has developed two freely available S libraries for UNIX and Windows that are available in the Statlib archive in lib.stat.cmu.edu or from the UVa web page. The Hmisc library (“Harrell Miscellaneous”) is described in Section 2.9, and the Design library is described in Chapter 9. Once these libraries are installed4 , get access to their functions and datasets by typing
library(Hmisc, T) library(Design,T) # Reference Hmisc before referencing Design # Design requires Hmisc to work

The T (first=T in expanded notation) is needed because Hmisc and Design override a few builtin functions. A Hmisc contains a family of latex functions for converting certain S objects to typeset L TEX A X code. You can also representation. The output of these functions is a text ?le containing L TE A preview typeset L TEX ?les while running S.
4 These functions are built–in to S-Plus2000 and later (on Windows only) but they still must be accessed using library() or File ... Load Library

2.9. THE HMISC LIBRARY

45

2.9

The Hmisc Library

The Hmisc library contains around 200 miscellaneous functions useful for such things as data analysis, high–level graphics, utility operations, functions for computing sample size and power, translating SAS datasets into S, imputing missing values, advanced table making, variable clustering, character A string manipulation, conversion of S objects to L TEX code, recoding variables, and bootstrap repeated measures analysis. The help categories for Hmisc serve to describe the areas covered by this library:

ANOVA Models Add to Existing Plot Bootstrap Categorical Data Character Data Operations Clustering Computations Related to Plotting Data Directories Data Manipulation Documentation Grouping Observations High-Level Plots Interfaces to Other Languages Linear Algebra Logistic Regression Model Mathematical Operations Matrices and Arrays Methods and Generic Functions Miscellaneous Multivariate Techniques Nonparametric Statistics Overview Power and Sample Size Calculations Predictive Accuracy Printing Probability Distributions and Random Numbers Regression Repeated Measures Analysis Robust/Resistant Techniques Sampling Smoothing Operations Statistical Inference Statistical Models Study Design Survival Analysis Utilities

46CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES

A list of functions in Hmisc along with a brief description follows. Function Name _____________ abs.error.pred Purpose _______________________________________________________

Computes various indexes of predictive accuracy based on absolute errors, for linear models approxExtrap Linear extrapolation aregImpute Multiple imputation based on additive regression, bootstrapping, and predictive mean matching all.is.numeric Check if character strings are legal numerics areg.boot Nonparametrically estimate transformations for both sides of a multiple additive regression, and bootstrap these estimates and R^2 ballocation Optimum sample allocations in 2-sample proportion test binconf Exact confidence limits for a proportion and more accurate (narrower!) score stat.-based Wilson interval (Rollin Brant, mod. FEH) bootkm Bootstrap Kaplan-Meier survival or quantile estimates bpower Approximate power of 2-sided test for 2 proportions Includes bpower.sim for exact power by simulation bpplot Box-Percentile plot (Jeffrey Banfield, umsfjban@bill.oscs.montana.edu) bsamsize Sample size requirements for test of 2 proportions bystats Statistics on a single variable by levels of >=1 factors bystats2 2-way statistics calltree Calling tree of functions (David Lubinsky, david@hoqax.att.com) character.table Shows numeric equivalents of all latin characters Useful for putting many special chars. in graph titles (Pierre Joyet, pierre.joyet@bluewin.ch) ciapower Power of Cox interaction test cleanup.import More compactly store variables in a data frame, and clean up problem data when e.g. Excel spreadsheet had a nonnumeric value in a numeric column combine.levels Combine infrequent levels of a categorical variable comment Attach a comment attribute to an object: comment(fit) <- ’Used old data’ comment(fit) # prints comment confbar Draws confidence bars on an existing plot using multiple confidence levels distinguished using color or gray scale contents Print the contents (variables, labels, etc.) of a data frame cpower Power of Cox 2-sample test allowing for noncompliance Cs Vector of character strings from list of unquoted names csv.get Enhanced importing of comma separated files labels

2.9. THE HMISC LIBRARY cut2 Like cut with better endpoint label construction and allows construction of quantile groups or groups with given n datadensity Snapshot graph of distributions of all variables in a data frame. For continuous variables uses scat1d. dataRep Quantify representation of new observations in a database ddmmmyy SAS "date7" output format for a chron object deff Kish design effect and intra-cluster correlation describe Function to describe different classes of objects. Invoke by saying describe(object). It calls one of the following: describe.data.frame Describe all variables in a data frame (generalization of SAS UNIVARIATE) describe.default Describe a variable (generalization of SAS UNIVARIATE) do Assists with batch analyses dot.chart Dot chart for one or two classification variables Dotplot Enhancement of Trellis dotplot allowing for matrix x-var., auto generation of Key function, superposition drawPlot Simple mouse-driven drawing program, including a function for fitting Bezier curves ecdf Empirical cumulative distribution function plot eip Edit an object "in-place" (may be dangerous!), e.g. eip(sqrt) will replace the builtin sqrt function errbar Plot with error bars (Charles Geyer, U. Chi., mod FEH) event.chart Plot general event charts (Jack Lee, jjlee@mdanderson.org, Ken Hess, Joel Dubin; Am Statistician 54:63-70,2000) event.history Event history chart with time-dependent cov. status (Joel Dubin, joel.dubin@yale.edu) find.matches Find matches (with tolerances) between columns of 2 matrices first.word Find the first word in an S expression (R Heiberger) fit.mult.impute Fit most regression models over multiple transcan imputations, compute imputation-adjusted variances and avg. betas format.df Format a matrix or data frame with much user control (R Heiberger and FE Harrell) ftupwr Power of 2-sample binomial test using Fleiss, Tytun, Ury ftuss Sample size for 2-sample binomial test using " " " " (Both by Dan Heitjan, dheitjan@biostats.hmc.psu.edu) gbayes Bayesian posterior and predictive distributions when both the prior and the likelihood are Gaussian getHdata Fetch and list datasets on our web site gs.slide Sets nice defaults for graph sheets for S-Plus 4.0 for copying graphs into Microsoft applications hdquantile Harrell-Davis nonparametric quantile estimator with s.e. histbackback Back-to-back histograms (Pat Burns, Salomon Smith Barney, London, pburns@dorado.sbi.com)

47

48CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES hist.data.frame Matrix of histograms for all numeric vars. in data frame Use hist.data.frame(data.frame.name) histSpike Add high-resolution spike histograms or density estimates to an existing plot hoeffd Hoeffding’s D test (omnibus test of independence of X and Y) impute Impute missing data (generic method) %in% Find out which elements a are in b : a %in% b interaction More flexible version of builtin function is.present Tests for non-blank character values or non-NA numeric values james.stein James-Stein shrinkage estimates of cell means from raw data labcurve Optimally label a set of curves that have been drawn on an existing plot, on the basis of gaps between curves. Also position legends automatically at emptiest rectangle. label Set or fetch a label for an S-object Lag Lag a vector, padding on the left with NA or ’’ latex Convert an S object to LaTeX (R Heiberger & FE Harrell) ldBands Lan-DeMets bands for group sequential tests list.tree Pretty-print the structure of any data object (Alan Zaslavsky, zaslavsk@hcp.med.harvard.edu) Load Enhanced version of load mask 8-bit logical representation of a short integer value (Rick Becker) matchCases Match each case on one continuous variable matxv Fast matrix * vector, handling intercept(s) and NAs mem mem() types quick summary of memory used during session mgp.axis Version of axis() that uses appropriate mgp from mgp.axis.labels and gets around bug in axis(2, ...) that causes it to assume las=1 mgp.axis.labels Used by survplot and plot in Design library (and other functions in the future) so that different spacing between tick marks and axis tick mark labels may be specified for x- and y-axes. ps.slide, win.slide, gs.slide set up nice defaults for mgp.axis.labels. Otherwise use mgp.axis.labels(’default’) to set defaults. Users can set values manually using mgp.axis.labels(x,y) where x and y are 2nd value of par(’mgp’) to use. Use mgp.axis.labels(type=w) to retrieve values, where w=’x’, ’y’, ’x and y’, ’xy’, to get 3 mgp values (first 3 types) or 2 mgp.axis.labels. minor.tick Add minor tick marks to an existing plot mtitle Add outer titles and subtitles to a multiple plot layout mulbar.chart Multiple bar chart for one or two classification variables %nin% Opposite of %in% nomiss Return a matrix after excluding any row with an NA panel.bpplot Panel function for trellis bwplot - box-percentile plots

2.9. THE HMISC LIBRARY panel.plsmo pc1 Panel function for trellis xyplot - uses plsmo Compute first prin. component and get coefficients on original scale of variables plotCorrPrecision Plot precision of estimate of correlation coefficient plsmo Plot smoothed x vs. y with labeling and exclusion of NAs Also allows a grouping variable and plots unsmoothed data popower Power and sample size calculations for ordinal responses (two treatments, proportional odds model) prn prn(expression) does print(expression) but titles the output with ’expression’. Do prn(expression,txt) to add a heading (’txt’) before the ’expression’ title p.sunflowers Sunflower plots (Andreas Ruckstuhl, Werner Stahel, Martin Maechler, Tim Hesterberg) ps.slide Set up postcript() using nice defaults for different types of graphics media pstamp Stamp a plot with date in lower right corner (pstamp()) Add ,pwd=T and/or ,time=T to add current directory name or time Put additional text for label as first argument, e.g. pstamp(’Figure 1’) will draw ’Figure 1 date’ putKey Different way to use key() putKeyEmpty Put key at most empty part of existing plot rcorr Pearson or Spearman correlation matrix with pairwise deletion of missing data rcorr.cens Somers’ Dyx rank correlation with censored data rcorrp.cens Assess difference in concordance for paired predictors rcspline.eval Evaluate restricted cubic spline design matrix rcspline.plot Plot spline fit with nonparametric smooth and grouped estimates rcspline.restate Restate restricted cubic spline in unrestricted form, and create TeX expression to print the fitted function recode Recodes variables reShape Reshape a matrix into 3 vectors, reshape serial data rm.boot Bootstrap spline fit to repeated measurements model, with simultaneous confidence region - least squares using spline function in time rMultinom Generate multinomial random variables with varying prob. samplesize.bin Sample size for 2-sample binomial problem (Rick Chappell, chappell@stat.wisc.edu) sas.get Convert SAS dataset to S data frame sasxport.get Enhanced importing of SAS transport dataset in R Save Enhanced version of save scat1d Add 1-dimensional scatterplot to an axis of an existing plot (like bar-codes, FEH/Martin Maechler, maechler@stat.math.ethz.ch/Jens Oehlschlaegel-Akiyoshi, oehl@psyres-stuttgart.de)

49

50CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES score.binary Construct a score from a series of binary variables or expressions sedit A set of character handling functions written entirely in S. sedit() does much of what the UNIX sed program does. Other functions included are substring.location, substring<-, replace.string.wild, and functions to check if a string is numeric or contains only the digits 0-9 setpdf Adobe PDF graphics setup for including graphics in books and reports with nice defaults, minimal wasted space setps Postscript graphics setup for including graphics in books and reports with nice defaults, minimal wasted space Internally uses psfig function by Antonio Possolo (antonio@atc.boeing.com). setps works with Ghostscript to convert .ps to .pdf setTrellis Set Trellis graphics to use blank conditioning panel strips, line thickness 1 for dot plot reference lines: setTrellis(); 3 optional arguments show.col Show colors corresponding to col=0,1,...,99 show.pch Show all plotting characters specified by pch=. Just type show.pch() to draw the table on the current device. showPsfrag Use LaTeX to compile, and dvips and ghostview to display a postscript graphic containing psfrag strings solvet Version of solve with argument tol passed to qr somers2 Somers’ rank correlation and c-index for binary y spearman Spearman rank correlation coefficient spearman(x,y) spearman.test Spearman 1 d.f. and 2 d.f. rank correlation test spearman2 Spearman multiple d.f. rho^2, adjusted rho^2, Wilcoxon-KruskalWallis test, for multiple predictors spower Simulate power of 2-sample test for survival under complex conditions Also contains the Gompertz2,Weibull2,Lognorm2 functions. spss.get Enhanced importing of SPSS files using R’s read.spss function src src(name) = source("name.s") with memory stata.get Enhanced importing of Stata files using R’s read.dta function store store an object permanently (easy interface to assign function) strmatch Shortest unique identifier match (Terry Therneau, therneau@mayo.edu) subset More easily subset a data frame substi Substitute one var for another when observations NA summarize Generate a data frame containing stratified summary statistics. Useful for passing to trellis. summary.formula General table making and plotting functions for summarizing data symbol.freq X-Y Frequency plot with circles’ area prop. to frequency

2.10. INSTALLING ADD–ON LIBRARIES sys tex Execute unix() or dos() depending on what’s running Enclose a string with the correct syntax for using with the LaTeX psfrag package, for postscript graphics. transace ace() packaged for easily automatically transforming all variables in a matrix transcan automatic transformation and imputation of NAs for a series of predictor variables trap.rule Area under curve defined by arbitrary x and y vectors, using trapezoidal rule trellis.strip.blank To make the strip titles in trellis more visible, you can make the backgrounds blank by saying trellis.strip.blank(). Use before opening the graphics device. t.test.cluster 2-sample t-test for cluster-randomized observations uncbind Form individual variables from a matrix units Set or fetch "units" attribute - units of measurement for var. upData Update a data frame (change names, labels, remove vars, etc.) varclus Graph hierarchical clustering of variables using squared Pearson or Spearman correlations or Hoeffding D as similarities Also includes the naclus function for examining similarities in patterns of missing values across variables. xy.group Compute mean x vs. function of y by groups of x xYplot Like trellis xyplot but supports error bars and multiple response variables that are connected as separate lines win.slide Setup win.graph or win.printer using nice defaults for presentations/slides/publications wtd.mean, wtd.var, wtd.quantile, wtd.ecdf, wtd.table, wtd.rank, wtd.loess.noiter, num.denom.setup Set of function for obtaining weighted estimates zoom Zoom in on any graphical display (Bill Dunlap, bill@statsci.com)

51

The web page listed at the front of this document contains several datasets useful in learning about the Hmisc and Design libraries. Two of the data frames are especially useful for learning about logistic modeling with the Design library: titanic and titanic2. Both describe the survival status of individual passengers on the Titanic. The titanic data frame does not contain information from the crew, but it does contain actual ages of half of the passengers.

2.10

Installing Add–on Libraries

For Windows, many of the libraries available in Statlib are transported as compressed (.zip) ?les. Installation in this case is trivial, as the user merely needs to unzip5 the ?le into the S-Plus library
5 Use a recent version of WinZip (from www.winzip.com) or a recent version of unzip that preserves long ?le names for Windows 95. A good version of unzip is available under Utilities in the Web page listed on the cover of this document. The UVa Web page, under Statistical Computing Tools, has more instructions for installing add–on libraries using WinZip.

52CHAPTER 2. OBJECTS, GETTING HELP, FUNCTIONS, ATTRIBUTES, AND LIBRARIES area6 . Windows S libraries that call Fortran or C routines (as Hmisc and Design do) are so easy to install because the object modules for these routines is stored in a standard format that works on all Windows machines7 . Therefore the user does not have to have a compiler on her machine. UNIX users install the libraries using a Makefile which invokes compilers as needed. Some users do not have a Fortran 77 or Fortran 90 compiler on their UNIX system; they have to install such a compiler before installing Hmisc or Design. A Fortran–to–C translator produces Fortran code that is too ine?cient to be used. Some of the code that needs to be compiled is actually structured Fortran (Ratfor), which needs a Ratfor pre–processor to translate it to Fortran. Users without Ratfor can get pre–processed code already translated to Fortran from FE Harrell8 . To install or update the Hmisc or Design library for R, download the appropriate ?le from http://biostat.mc.vanderbilt.edu/RS (.zip ?le for Windows; .tar.gz ?le for Linux/Unix) and store it in a directory for holding temporary ?les. If using Windows select the appropriate menu to install/update the package from a local ?le. If using Linux/Unix issue a shell command like R CMD INSTALL /tmp/packagename.tar.gz while logged in as superuser. When Hmisc and Design become part of CRAN, they may be installed like other CRAN packages (e.g., by issuing a command like install.packages(’Hmisc’) or update.packages(’Hmisc’) at the R command prompt).

2.11

Accessing Add–On Libraries Automatically

As described in more detail in Section 13.6, you can create a special function in your _Data area that is executed each time S is invoked from your project area. The function is called .First. A common use of .First is to do away with the need to issue a library command each time you invoke S. You can de?ne a .First function once and for all by entering statements such as these in a Commands or Script window.
.First ← function() { library(Hmisc,T) invisible() }

The invisible function prevents the .First function from printing anything when it is invoked. For R use the command library(Hmisc) instead of library(Hmisc,T). If you create a .First function for R it will be stored in .RData. Because Hmisc has a variety of basic functions that are useful in routine data analysis and because attaching the Hmisc library carries almost no overhead, it can be a good idea to create such a .First function for each project area 9 .

6 E.g., /splus/library as most .zip ?les for add–on libraries have been created so that during extraction they will be stored in the correct subdirectory of /splus/library. 7 Similarly, help ?les are stored in compiled Microsoft Help format, so these also install easily. 8 But note that S-Plus comes with a Ratfor pre–processor too. 9 Hmisc overrides the system subscripting method for factor vectors and date vectors, and it de?nes functions is.na.dates and is.na.times to check for NAs in date and time vectors. The [.factor rede?nition by Hmisc causes by default unused levels to be dropped from the factor vector’s levels attribute when the vector is subscripted. This can be overridden by using for example x ← x[,drop=F] or by specifying a system option as follows: options(drop.factor.levels=F).

Chapter 3

Data in S

3.1

Importing Data

If you are using Windows S-Plus, most datasets you will need to analyze will be in a format that can be imported easily using the File ... Import dialog. For example, Excel spreadsheets, text (ASCII) ?les, and data from other popular statistical software can be converted to S-Plus internal format this way. This method is fast but not all data attributes (e.g., SAS variable labels and value labels) may be imported (see Section 3.2.3). Watch out for non-numeric values in Excel numeric columns, which S-Plus will import as in?nity rather than NA. The Hmisc cleanup.import function will change such values to NA as well as set the storage mode of numeric variables to ’single’ or ’integer’ depending on whether fractional values are present. This will result in cutting storage in half for numeric variables, as S-Plus imports these as double precision variables (16 signi?cant digits). cleanup.import also ?xes another problem where numeric variables are mistakenly converted to factors. The Hmisc upData function does some of the same functions of cleanup.import in addition to allowing one to change the data frame in many ways (see Section 4.1.5). The remainder of this chapter deals with commands (functions) for reading and converting data.

3.2
3.2.1

Reading Data into S
Reading Raw Data

The two main functions for reading ASCII datasets into S are scan and read.table. scan is the most versatile of the two, and read.table is easier to use. read.table expects the input data sets to be arranged in tabular form, where the ?rst line may or may not be the variable names. The syntax is
> args(read.table)

53

54
function(file, header = F, sep = "", row.names = NULL, col.names = paste("V", 1: fields, sep = ""), as.is = F, na.strings = "NA")

CHAPTER 3. DATA IN S

The ?rst argument is a character string re?ecting the dataset name; header is set to T if the ?rst line of the ?le contains the variable names; sep is the separator between ?elds (by default, any number of blanks); the row.names argument can be an already existing vector of the same length as the number of observations or the name of a variable in the dataset. In either case, it should have no duplicates. col.names is used to give names to variables when header is F and as.is controls which ?elds are converted to factors. By default, character ?elds are always made into factor objects. Finally, na.strings can be used whether certain values in character strings should be included as levels of a factor. The result of read.table is a data frame. The function scan is more complicated and we will only give a sketch here.
> args(scan) function(file = "", what = double(0), n = -1, sep = "", multi.line = F, flush = F, append = F, skip = 0, widths = NULL, strip.white = NULL)

The most important arguments here are file and what. The ?rst one is just the name of your dataset, and what is sort of like an INPUT statement. It is a list giving the names and the modes of the data. Example,
> z ← scan("myfile",list(pop=0,city=character()))

In this case, we are reading from the dataset "myfile" the ?rst two columns and naming them pop and city. The 0 after the equal sign in pop only means that it is going to be read as a numeric variable. Any other number or the expression numeric(0) would have had the same e?ect. Similarly with the character() expression. In S-Plus for Windows you can also read ASCII ?les using point-and-click methods through the File menu.

3.2.2

Reading S-Plus Data into R

The best way to transport S-Plus vectors, matrices, and data frames to other computers or other versions of S-Plus or to R is to run data.dump() in S-Plus to create a dumpdata-format (S-Plus transport format or .sdd ?le, as described in Section 3.5.2. If using S-Plus version 5 or later, use the oldStyle=T option to data.dump. Then convert the object to an R object using code such as the following.
library(foreign) data.restore(’/tmp/my.sdd’) # name of resulting object comes from # original name when my.sdd created

You can read binary S objects in _Data or .Data directories and convert them to R objects in some cases using R’s read.S function in R’s foreign library, if the object was created by S-Plus versions before version 5 (e.g., conversion of S-Plus 2000 binary objects usually works). Here is an example:
library(foreign) # Print file _Data/___nonfi to see mapping of renamed files

3.2. READING DATA INTO S
# to object names newobj ← read.S(’_Data/__7’)

55

# must provide a name to hold result

Check the resulting object carefully, because read.S is not foolproof.

3.2.3

Reading SAS Datasets

In many cases, the easiest way to read external ?les is to read SAS datasets directly. This can be done two ways. First, you can use File ... Import or a standalone database conversion utility such as DBMSCOPY. This approach has the advantages of speed of execution, ease of use, and lack of need of creating temporary ASCII ?les1 . There are several disadvantages for either fast import method, however: (1) They do not carry SAS variable labels into S. (2) They ignore value labels for categorical variables created using SAS PROC FORMAT. (3) They do not transport SAS special missing values. (4) S variable names constructed from SAS names are in all upper case2 . The sas.get function in the Hmisc library for UNIX or Windows is the other approach to convert SAS datasets. sas.get preserves all SAS data attributes, and if categorical variables have customized FORMATs associated with them, sas.get has several options for de?ning the category labels to S (typically as factor variables). Long before converting SAS data to S, you should have prepared the SAS dataset so that it would be as useful as possible in SAS. Then sas.get can also pro?t from this setup. Here are the relevant points to consider when creating your SAS dataset: 1. De?ne LABELs on all variables that are not totally self-documenting. The labels should contain mostly lower case letters, as such labels are not only easier to read but they will result in prettier SAS and S output. If you did not take the time to create pretty SAS labels, you can create or override labels after reading the data into S. 2. Use the minimum SAS LENGTH that will store each character or numeric variable. For number variables, SAS uses a default of 8 bytes of storage, which is 16 signi?cant digits. Such precision is very seldom needed, and it will result in highly in?ated SAS and S datasets. Many SAS variables can be stored as 3 byte ?oating points, which yields 4 signi?cant digits. 3. De?ne category level de?nitions using PROC FORMAT, and associate the formats permanently with the appropriate variables. 4. Don’t store dummy variables and other derived variables (e.g., interaction products) in the permanent SAS dataset, and if you do, don’t retrieve them into S as S derives such variables on the ?y. If you do not have nice variable labels or category levels set up in SAS, you can always create them or rede?ne them in S:
sex ← factor(sex, 1:2, c(’female’,’male’)) levels(treatment)[3] ← ’Dextran’ levels(location) ← edit(levels(location)) # edit them interactively label(location) ← ’Location of last inspection’
1 The 2 This

sas.get function has to create temporary ASCII ?les to do the SAS to S translation. can easily be remedied — see Section 3.4.

56

CHAPTER 3. DATA IN S

The Label function which is documented under the label function will create a text ?le containing S code de?ning the existing labels for all the variables in a data frame. You can edit that code, overriding any labels you don’t like (including blank ones) and source that ?le back into S. Call Label using the syntax Label(dataframename, file=). Omit ,file= to write labels to the command window for copying and pasting into an editor window. Here is the help ?le for the Windows version of sas.get. The UNIX version does not have the sasout argument, and there are a few other di?erences. When one or more of the variables you are rescuing from SAS has a PROC FORMAT format associated with it, it is best to use the recode=T option (the default) when invoking sas.get. sas.get Convert a SAS Dataset to an S Dataset sas.get

Converts a SAS dataset into an S data frame. You may choose to extract only a subset of variables or a subset of observations in the SAS dataset. You may have the function automatically convert PROC FORMAT-coded variables to factor objects. The original SAS codes are stored in an attribute called sas.codes and these may be added back to the levels of a factor variable using the code.levels function. Information about special missing values may be captured in an attribute of each variable having special missing values. This attribute is called special.miss, and such variables are given class special.miss. There are print, [], format, and is.special.miss methods for such variables. The chron function is used to set up date, time, and date-time variables. If a date variable represents a partial date (.5 added if month missing, .25 added if day missing, .75 if both), an attribute partial.date is added to the variable, and the variable also becomes a class imputed variable. The describe function uses information about partial dates and special missing values. There is an option to automatically PKUNZIP compressed SAS datasets.
sas.get works by composing and running a SAS job that creates various ascii ?les that are read and analyzed by sas.get. You can also run the SAS sas_get macro, which writes

the ascii ?les for downloading, in a separate step or on another computer, and then tell sas.get to access these ?les instead of running SAS.
sas.get(library, member, variables=<<see below>>, ifs=<<see below>>, format.library=library, sasout, formats=F, recode=formats, special.miss=F, id=<<see below>>, as.is=.5, check.unique.id=T, force.single=F, keep.log=T, log.file="_temp_.log", macro=sas.get.macro, clean.up=T, sasprog="sas", where, unzip=F) is.special.miss(x, code) x[...] print(x) format(x) sas.codes(x) x ← code.levels(x) ARGUMENTS library: character string naming the directory in which the the dataset is kept. The default is library=".", indicating that the current directory is to be used.

3.2. READING DATA INTO S

57

member: character string giving the second part of the two part SAS dataset name. (The ?rst part

is irrelevant here — it is mapped to the directory name.)
x: a variable that may have been created by sas.get with special.miss=T or with recode in

e?ect.
variables: vector of character strings naming the variables in the SAS dataset. The S dataset will

contain only those variables from the SAS dataset. To get all of the variables (the default), an empty string may be given. It is a fatal error if any one of the variables is not in the SAS dataset.
ifs: a vector of character strings, each containing one SAS ”subsetting if” statement. These

will be used to extract a subset of the observations in the SAS dataset.
format.library: The directory containing the ?le formats.sc2, which contains the de?nitions of

the user de?ned formats used in this dataset. By default, we look for the formats in the same directory as the data. The user de?ned formats must be available (so SAS can read the data).
sasout: If SAS has already run to create the ascii ?les needed to complete the creation of the S

data frame, specify a vector of 4 character strings containing the names of the ?les (with full path names if the ?les are not on the current working directory). The ?les are in the following order: data dictionary, data, formats, special missing values. This is the same order that the ?le names are speci?ed to the sas_get macro. For ?les which were not created and hence not applicable, specify "" as the ?le name. The presence/absence of formats and special missing data ?les is used to set the formats and special.miss arguments automatically by sas.get.
sasout may also be a character string of length one, in which case it is assumed to be the name of a .zip ?le, and sas.get automatically runs the DOS PKUNZIP command to extract

the component ?les to the current working directory. The ?les that are present in the .zip ?le must have names "dict","data","formats","specmiss" (although "formats" and "specmiss" do not have to be present). When sas.get is ?nished, these extracted ?les are automatically deleted. .zip ?les are useful for downloading large datasets.
formats: Set formats to T to examine the format.library for appropriate formats and store them as the formats attribute of the returned object (see below). A format is used if it is referred to

by one or more variables in the dataset, if it contains no ranges of values (i.e., it identi?es value labels for single values), and if it is a character format or a numeric format that is not used just to label missing values. If you set recode to T, 1, or 2, formats defaults to T. To fetch the values and labels for variable x in the dataset d you could type: f ←attr(d$x, ”format”) formats ←attr(d, ”formats”) formats$f$values; formats$f$labels
recode: This parameter defaults to T if formats is T. If it is T, variables that have an appropriate format (see above) are recoded as factor objects, which map the values to the value labels for the format. Alternatively, set recode to 1 to use labels of the form value:label, e.g. 1:good 2:better 3:best. Set recode to 2 to use labels such as good(1) better(2) best(3). Since sas.codes and code.levels add ?exibility, the usual choice for recode is T.

58

CHAPTER 3. DATA IN S

special.miss: For numeric variables, any missing values are stored as NA in S. You can recover special missing values by setting special.miss to T. This will cause the special.miss attribute and the special.miss class to be added to each variable that has at least one special missing value. Suppose that variable y was .E in observation 3 and .G in observation 544. The special.miss attribute for y then has the value list(codes=c(”E”,”G”),obs=c(3,544))) To fetch this information for variable y you would say for example s ←attr(y, ”special.miss”) s$codes; s$obs or use is.special.miss(x) or the print.special.miss method, which will replace NA values for the variable with E or G if they correspond to special missing values.

The describe function uses this information in printing a data summary.
id: The name of the variable to be used as the row names of the S dataset. The id variable becomes the row.names attribute of a data frame, but the id variable is still retained as a variable in the data frame. You can also specify a vector of variable names as the id

parameter. After fetching the data from SAS, all these variables will be converted to character format and concatenated (with a space as a separator) to form a (hopefully) unique ID variable.
as.is: SAS character variables are converted to S factor objects if as.is=F or if as.is is a number

between 0 and 1 inclusive and the number of unique values of the variable is less than the number of observations (n) times as.is. The default if as.is is .5, so character variables are converted to factors only if they have fewer than n/2 unique values. The primary purpose of this is to keep unique identi?cation variables as character values in the data frame instead of using more space to store both the integer factor codes and the factor labels.
check.unique.id: If id is speci?ed, the row names are checked for uniqueness if check.unique.id=T.

If any are duplicated, a warning is printed. Note that if a data frame is being created with duplicate row names, statements such as my.data.frame["B23",] will retrieve only the ?rst row with a row name of "B23".
force.single: By default, SAS numeric variables having LENGTHs > 4 are stored as S double precision numerics, which allow for the same precision as a SAS LENGTH 8 variable. Set force.single=T

to store every numeric variable in single precision (7 digits of precision). This option is useful when the creator of the SAS dataset has failed to use a LENGTH statement.
keep.log: logical ?ag: if F, delete the SAS log ?le upon completion. log.file: the name of the SAS log ?le. macro: the name of an S object in the current search path that contains the text of the SAS

macro called by S. The S object is a character vector that can be edited using, for example, sas.get.macro ←editor(sas.get.macro).
clean.up: logical ?ag: if T, remove all temporary ?les when ?nished. You may want to keep these

while debugging the SAS macro.
sasprog: the name of the system command to invoke SAS

3.2. READING DATA INTO S

59

unzip: set to F by default. Set it to T to automatically invoke the DOS PKUNZIP command if member.zip exists, to uncompress the SAS dataset before proceeding. This assumes you

have the ?le permissions to allow uncompressing in place. If the ?le is already uncompressed, this option is ignored.
where: by default, a list or data frame which contains all the variables is returned. If you specify where, each individual variable is placed into a separate object (whose name is the name of the variable) using the assign function with the where argument. For example, you can

put each variable in its own ?le in a directory, which in some cases may save memory over attaching a data frame.
code: a special missing value code (A through Z or underscore) to check against. If code is omitted, is.special.miss will return a T for each observation that has any special missing

value.
VALUE

A data frame resembling the SAS dataset. If id was speci?ed, that column of the data frame will be used as the row names of the data frame. Each variable in the data frame or vector in the list will have the attributes label and format containing SAS labels and formats. Underscores in formats are converted to periods. Formats for character variables have $ placed in front of their names. If formats is T and there are any appropriate format de?nitions in format.library, the returned object will have attribute formats containing lists named the same as the format names (with periods substituted for underscores and character formats pre?xed by $). Each of these lists has a vector called values and one called labels with the PROC FORMAT; VALUE ... de?nitions.
SIDE EFFECTS

if a SAS error occurs the SAS log ?le will be printed under the control of the pager function.
DETAILS

If you specify special.miss=T and there are no special missing values in the data SAS dataset, the SAS step will bomb. For variables having a PROC FORMAT VALUE format with some of the levels unde?ned, sas.get will interpret those values as NA if you are using recode. If you leave the sasprog argument at its default value of "sas", be sure that the SAS executable is in the PATH speci?ed in your autoexec.bat ?le. Also make sure that you invoke S so that your current project directory is known to be the current working directory. This is best done by creating a shortcut in Windows95, for which the command to execute will be something like drive:\spluswin\cmd\splus.exe HOME=. and the program is ?agged to start in drive:\myproject for example. In this way, you will be able to examine the SAS log ?le easily since it will be placed in drive:\myproject by default. SAS will create SASWORK and SASUSER directories in what it thinks are the current working directories. To specify where SAS should put these instead, edit the config.sas ?le or specify a sasprog argument of the following form: sasprog="\sas\sas.exe -saswork c:\saswork -sasuser c:\sasuser". When sas.get needs to run SAS it is run in iconized form.

60

CHAPTER 3. DATA IN S The SAS macro sas_get uses record lengths of up to 4096 in two places. If you are exporting records that are very long (because of a large number of variables and/or long character variables), you may want to edit these LRECLs to quadruple them, for example.

NOTE

If sasout is not given, you must be able to run SAS on your system. If you are reading time or date-time variables, you will need to execute the command library(chron) to print those variables or the data frame.
BACKGROUND

The references cited below explain the structure of SAS datasets and how they are stored. See SAS Language for a discussion of the “subsetting if” statement.
AUTHORS

Frank Harrell, University of Virginia, Terry Therneau, Mayo Clinic, Bill Dunlap, University of Washington and MathSoft.
REFERENCES

SAS Institute Inc. (1990). SAS Language: Reference, Version 6. First Edition. SAS Institute Inc., Cary, North Carolina. SAS Institute Inc. (1988). SAS Technical Report P-176, Using the SAS System, Release 6.03, under unix Operating Systems and Derivatives. SAS Institute Inc., Cary, North Carolina. SAS Institute Inc. (1985). SAS Introductory Guide. Third Edition. SAS Institute Inc., Cary, North Carolina.
SEE ALSO data.frame, describe, impute, chron, print.display, label EXAMPLE > mice ← sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50")) > plot(mice$dose, mice$ld50) > nude.mice ← sas.get(lib=unix("echo $HOME/saslib"), mem="mice", ifs="if strain=’nude’") > nude.mice.dl ← sas.get(lib=unix("echo $HOME/saslib"), mem="mice", var=c("dose", "ld50"), ifs="if strain=’nude’") > # Get a dataset from current directory, recode PROC FORMAT; VALUE ... > # variables into factors with labels of the form "good(1)" "better(2)", > # get special missing values, recode missing codes .D and .R into new > # factor levels "Don’t know" and "Refused to answer" for variable q1 > d ← sas.get(mem="mydata", recode=2, special.miss=T) > attach(d) > nl ← length(levels(q1)) > lev ← c(levels(q1), "Don’t know", "Refused") > q1.new ← as.integer(q1) > q1.new[is.special.miss(q1,"D")] ← nl+1 > q1.new[is.special.miss(q1,"R")] ← nl+2

3.2. READING DATA INTO S
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > q1.new ← factor(q1.new, 1:(nl+2), lev) # Note: would like to use factor() in place of as.integer ... but # factor in this case adds "NA" as a category level d ← sas.get(mem="mydata", recode=T) sas.codes(d$x) # for PROC FORMATted variables returns original data codes d$x ← code.levels(d$x) # or attach(d); x ← code.levels(x) # This makes levels such as "good" "better" "best" into e.g. # "1:good" "2:better" "3:best", if the original SAS values were 1,2,3 # For the following example, suppose that SAS is run on a # different machine from the one on which S is run. # The sas_get macro is used to create files needed by # sas.get (To make a text file containing the sas_get macro # run the following S command, for example: # cat(sas.get.macro, file=’/sasmacro/sas_get.sas’, sep=’\n’) # Here is the SAS job. This job assumes that you put # sas_get.sas in an autocall macro library. # # # # # # # libname db ’/my/sasdata/area’; %sas_get(db.mydata, dict, data, formats, specmiss, formats=1, specmiss=1) Substitute Next the 4 ascii file program is whatever file names you may want. files are moved to the S machine (using transfer mode) and the following S run:

61

mydata ← sas.get(sasout=c(’dict’,’data’,’formats’,’specmiss’), id=’idvar’) # If PKZIP is run after sas_get, e.g. "PKZIP port dict data formats" # (assuming that specmiss was not used here), use mydata ← sas.get(sasout=’a:port’, id=’idvar’) # which will run PKUNZIP port to unzip a:port.zip, creating the # dict, data, and formats files which are interpreted (and leter # deleted) by sas.get

sas.get calls a SAS macro which produces an ASCII dataset and then uses scan to read it into an S object. If there are errors during the SAS macro processing step, the log ?le is displayed on the screen (unless quiet=T). This way you can usually know what type of error you have. A common error is that your dataset is in some directory and your formats catalog is in another while omitting the formats.library argument to sas.get (see below). Another error you may ?nd is the message “?le such and such not found”. On some systems, this condition may occur if your SAS dataset has not been modi?ed in a while and the system compressed it automatically. Set uncompress=T in this case. Also, if you don’t have special missing values, do not set special.miss to T. The sas_get SAS macro speci?es the system option NOFMTERR, so if customized formats or format

62

CHAPTER 3. DATA IN S

libraries are not found, SAS will procede as if the o?ending variables did not have a format associated with them. This works ?ne when the unde?ned formats correspond to variables not requested for retrieval. If however you request a variable having a missing format, you may not know about it until you run describe or other functions.

3.2.4

Handling Date Variables in R

R has a comprehensive way of storing and operating on date, time, and date/time values based on POSIX notation. Type ?DateTimeClasses for details. If you import SAS datasets into R using sas.get, SAS date, time, and date/time variables are automatically converted into R’s POSIXct variables. If you read date/time ?elds from ASCII text ?les, the following example shows how to convert into POSIXct variables. Suppose that a comma separated ?le test.csv contains the following data: age,date 21,12/31/02 22,01/01/03 23,1/1/02 24,12/1/02 25,12/1/02 26, The following program can read and recode the data.
> mydata age date 1 21 12/31/02 2 22 01/01/03 3 23 1/1/02 4 24 12/1/02 5 25 12/1/02 6 26 > d ← mydata$date > d [1] 12/31/02 01/01/03 1/1/02 12/1/02 12/1/02 Levels: 01/01/03 1/1/02 12/1/02 12/31/02 > d > # > # > # > d [1] [5] ← as.POSIXct(strptime(as.character(d),format=’%m/%d/%y’)) For 4-digit years, use format=’%m/%d/%Y’ If data were in the format yyyy-mm-dd the conversion would be as simple as d <- as.POSIXct(d) "2002-12-31 EST" "2003-01-01 EST" "2002-01-01 EST" "2002-12-01 EST" "2002-12-01 EST" NA

> format(d, ’%d%b%Y’) [1] "31Dec2002" "01Jan2003" "01Jan2002" "01Dec2002" "01Dec2002" NA > # Create a function to make it easy to reformat multiple variables

3.3. DISPLAYING METADATA
> dtrans ← function(x, format=’%m/%d/%y’) + as.POSIXct(strptime(as.character(x),format)) > > mydata$date ← dtrans(mydata$date) > mydata age date 1 21 2002-12-31 2 22 2003-01-01 3 23 2002-01-01 4 24 2002-12-01 5 25 2002-12-01 6 26 <NA> > unclass(mydata$date) # internal values [1] 1041310800 1041397200 1009861200 1038718800 1038718800

63

NA

3.3

Displaying Metadata

The Hmisc contents function displays data about a data frame, including variable labels (if any), units (if any) storage modes, number of NAs, and the number of levels for factor variables. Here is an example.
> contents(pbc) 418 observations and 19 variables Maximum # NAs:136

Labels Levels Storage NAs bili Serum Bilirubin (mg/dl) single 0 albumin Albumin (gm/dl) single 0 stage Histologic Stage, Ludwig Criteria single 6 protime Prothrombin Time (sec.) single 2 sex Sex 2 integer 0 fu.days Time to Death or Liver Transplantation single 0 age Age single 0 spiders Spiders 2 integer 106 hepatom Hepatomagaly 2 integer 106 ascites Ascites 2 integer 106 alk.phos Alkaline Phosphatase (U/liter) single 106 sgot SGOT (U/ml) single 106 chol Cholesterol (mg/dl) single 134 trig Triglycerides (mg/dl) single 136 platelet Platelets (per cm^3/1000) single 110 drug Treatment 3 integer 0 status Follow-up Status single 0 edema Edema 3 integer 0 copper Urine Copper (ug/day) single 108 > con <- contents(pbc) > print(con, sort=’names’)

# or sort=’labels’,’NAs’

64

CHAPTER 3. DATA IN S

418 observations and 19 variables

Maximum # NAs:136

age albumin alk.phos ascites . . . .

Labels Levels Storage NAs Age single 0 Albumin (gm/dl) single 0 Alkaline Phosphatase (U/liter) single 106 Ascites 2 integer 106

3.4

Adjustments to Variables after Input

Whether raw data or a SAS dataset is used to create a data frame, and whether you used a command or a mouse click to import the data, it is frequently the case that variable names, labels, or value codes need adjustment. These items may be easily changed once and for all or they may be changed every time the data frame is “attached” (see Section 4.1.1). To change variable attributes permanently, the recommended approach is to use the Hmisc upData function (Section 4.1.5). But here are some of the basic methods that are available. For changing individual variables in a list or data frame we rely ?rst on the $ operator for addressing individual variables in a permanent list of variables. This was introduced in Section 2.5.2. The advantage of making permanent changes in the data frame is that all interactive analyses of that data frame will take advantage of all the new variable names and annotations without prefacing the analysis with statements such as those found below. In S-Plus Version 4.x and 2000 it is easy to change variable names by editing column names on a data sheet, but you will have to re-do this every time the source dataset changes and is in need of re-importing. The following method using the edit function has the same disadvantage but it works in all versions of S-Plus. Suppose that df is the newly created permanent data frame. The names may be edited using
names(df) ← edit(names(df))

or you can change individual names using for example
names(df)[2] ← ’Age’

This changed the name of the second variable on the data frame. Here is a trick for changing all the names to lower case:
names(df) ← casefold(names(df)) # casefold is builtin

Note: When the data are imported from an ASCII ?le, the best way to specify variable names is to enter them into the “column names” box under the Options tab during the ?le import operation. To permanently change or de?ne labels for variables, you can use statements such as the following.
label(df$age) ← ’Age in years’ label(df$chol) ← ’Cholesterol (mg%)’

To de?ne or change value labels we use the factor function and the levels attribute (if the variable is already a factor). Suppose that one variable, sex, has values 1 and 2 and that we need to de?ne

3.5. WRITING OUT DATA

65

these as ’female’ and ’male’, respectively, so that reports and plots will be annotated. Suppose that another variable is already a factor vector, but that we do not like its levels (’a’,’b’,’c’). The following statements will ?x both problems.
df$sex ← factor(df$sex, 1:2, c(’female’,’male’)) levels(df$treat) ← c(’Treatment A’,’Treatment B’,’Treatment C’) # This can also be done with the following command df$treat ← factor(df$treat, c(’a’,’b’,’c’), c(’Treatment A’,’Treatment B’,’Treatment C’))

When a variable is already a factor and you wish to change its levels, you can also use the edit function:
levels(v) ← edit(levels(v))

Sometimes the input data will contain a factor variable having one or more unused levels. You can delete unused levels from the levels attribute of a variable, say x, by typing x ← x[,drop=T]. If the Hmisc library is in e?ect you merely have to type x ← x[] as Hmisc uses a default value of drop=T for its [.factor factor subsetting method. Other sections show how to de?ne labels and value labels when you only want temporary assignments. This is simpler as you do not need the data frame pre?x as in the statements above. You can also attach the data frame in search position one to alleviate the need for the $ pre?xing:
attach(df, pos=1, use.names=F) sex ← factor(sex, 1:2, c(’female’,’male’)) levels(treat) ← ... label(w3) ← ’A-V area’ detach(1, ’df’)

See Section 4.1.1 for more on this point. See section Section 4.4 for more details about recoding variables, Section 4.1.3 for how to add new variables, and Section 4.1.4 for how to delete variables. Section 4.5 has a review of the many steps one typically goes through to create ready–to–analyze data frames. See Section 3.1 for more about the cleanup.import function, which can be run on any data frame.

3.5

Writing Out Data

There are generally two instances in which you want to write output to a ?le. To produce a printed report (which may be enhanced by using some kind of publishing software), or to produce a dataset which may be shared with other users. In the latter case, especially if the other users are not using S-Plus, the most straightforward way is to use File ... Export or DBMSCOPY or to write an ASCII ?le. The latter approach can be done with the function write.table.

3.5.1

Writing ASCII ?les

write.table is very similar to read.table. Its arguments and an example follow.

66

CHAPTER 3. DATA IN S
> args(write.table) function(data, file = "", sep = ",", append = F, quote.strings = F, dimnames.write = T, na = NA, end.of.row = "\n") > write.table(df,"df.ascii",sep=" ",dimnames.write=F,quote.strings=T) > !less df.ascii # escaping to UNIX and using the ’less’ pager > # Could use !notepad df.ascii under Windows "Treatment 1" 2.5 "Treatment 1" 3.5 "Treatment 1" 3.0 "Treatment 2" 4.6 "Treatment 2" 5.5 "Treatment 2" 5.3

3.5.2

Transporting S Data

S-Plus stores objects in an internal binary format that is speci?c to each hardware platform. Fortunately there is an ASCII transport format that can be used to move objects between any two machines. This format is called dumpdata or transport ?le format. You can write any S-Plus object to a transport ?le using the data.dump function3 , and you can read such ?les using data.restore. These functions also allow you to write or read a single ?le containing any number of objects. You can use the File ... Export Data or File ... Import Data dialogs to write or read transport ?les. When you read, all the objects are created or re-created into search position one.

3.5.3

Customized Printing

The basic function for producing customized output is the cat function. When used in conjunction with other functions like paste, round and format, it can print nicely formatted reports. The basic syntax for cat is cat("character string 1",object,"character string 2"). Ex:
> cat("The mean of x is",mean(x)) The mean of x is 4.06666666666667>

Two problems are immediately apparent here: one is that mean(x) is producing too many decimals. The other is that cat is not going to a new line after being executed. To go to a new line, the newline character \n must be included explicitly. To control the number of digits the functions round or format can be used. round(mean(x),3) will round the output of mean(x) to three signi?cant digits, while format(mean(x)) will print mean(x) with as many digits as the digits options is set.
> cat("The mean of x is",round(mean(x),3),"\n") The mean of x is 4.067 > options()$digits [1] 7 > options(digits=4) > cat("The mean of x is",format(mean(x)),"\n") The mean of x is 4.067

The options function controls some of the system options that are assumed by default such as maximum object size, number of digits, width of a printed line, etc. You can see all the options by
3 To

make the result backward compatible, specify oldStyle=T to data.dump when running on S-Plus 5 or 6.

3.6. USING THE HMISC LIBRARY TO INSPECT DATA

67

typing options(). The result of this action is a list, that’s why we typed options()$digits to get the value of just the digits option. The e?ect of format is to coerce objects to become character strings using a common format. cat prints its arguments in the order in which it encounters them, so, to print something like “value 1 value 2 ... value 10” you would have to type cat("value 1", ... ,"value 10"). The paste function is more e?cient for this purpose
> paste("Value",1:10) [1] "Value 1" "Value 2" [7] "Value 7" "Value 8" "Value 3" "Value 9" "Value 4" "Value 5" "Value 10" "Value 6"

Using cat in conjunction with paste will give us a nicer output
> cat(paste("Value",1:10),fill=8) Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Value 9 Value 10

paste returned a character string, using cat deleted the quotation marks. The argument fill instructed cat to put a new line at 8 characters. Other arguments to cat include file to send the output to a ?le that you name, append to cause cat to append any new output to an existing ?le (or destroy the contents of the ?le), and sep to insert characters between the arguments to cat in the output. (sep=" " is the default. It can be changed to "" for no spaces). The print.char.matrix function built-in to S-Plus is useful for printing hierarchical tables, as it automatically draws boxes separating cells of a table, and each cell can comprise multiple output lines. For R, print.char.matrix is in the Hmisc library.

3.5.4

Sending Output to a File

You can have S send the output of all commands to a ?le by using the sink function. cat will only send the results of its output to a ?le, while sink will send the results of every command to a ?le you name (or a command) until you instruct it not to do so.
> sink("myfile") # Send output to file myfile > cat("The mean of x is",round(mean(x),3)) > sink() # Redirect output to the S session

3.6

Using the Hmisc Library to Inspect Data

Once the data are read into S, the Hmisc library can be helpful in understanding them as well as checking for “holes” and invalid data. Suppose a data frame named w has been created. Here is a suggested program for taking some initial looks. See Section 4.3.3 for more on the sapply function.

68

CHAPTER 3. DATA IN S
w.des ← describe(w) # save describe() output page(w.des, multi=T) # put it in a Window that can linger win.graph() # open graphics window - openlook(), motif(), X11() for UNIX # not needed for S-Plus 4.x or later # First make a dot chart of the number of NAs for each variable, # sorting variables so that the worst offender is at the top m ← sapply(w,function(x)sum(is.na(x))) dotplot(sort(m), xlab=’NAs’) # naplot below does this automatically na.pattern(m) # gets frequencies of all NA patterns but # treats factor variables as always non-NA

nac ← naclus(w)

# # # nac # plot(nac) # naplot(nac) # hist.data.frame(w) # # datadensity(w) # # # ecdf(w) # #

compute all pairwise proportions of missing data and cluster variables according to similarity of occurrences of NAs print matrix of pairwise proportions cluster NA patterns graphically other displays of patterns of NA matrix of histograms for all non-binary variables also shows number of NAs make single graph with strip plots (1-dimensional scatterplots or rug plots) for all variables in w also consider using builtin plot(w) draw empirical cumulative distributions for all continuous variables. Also consider using bpplot().

# Now depict how the variables cluster, using squared Spearman rank # correlation coefficients as similarity measures. varclus uses # rcorr which does pairwise deletion of NAs plot(varclus(? x1 + x2 + x3 + ..., data=w)) # Assumes variables are named x1, x2, x3, ... # Use plot(varclus(?., data=w)) to analyze all variables # If any of the variables is missing frequently (say x2), find out what # predicts its missingness. Use a regression tree f ← tree(is.na(x2) ~ x1 + x3, data=w) # Could have used attach(w) to avoid data= above plot(f, type=’uniform’) text(f) # # # # Other useful functions for more detailed examinations of the data are bwplot, bpplot (box-percentile plots), bwplot with panel=panel.bpplot, and symbol.freq (for depicting two-way contingency tables).

See Section 11.3 for information about the ecdf, datadensity, and bpplot functions, and Section 6.1 for information about symbol.freq. See also the builtin function cdf.compare. And don’t forget a wonderful built-in function ‘plot.data.frame’ that nicely displays continuous variables (using CDFs

3.6. USING THE HMISC LIBRARY TO INSPECT DATA

69

turned sideways) and categorical ones (using frequency dot charts). With a high-resolution printer you can see up to 40 variables clearly on a single page. Here is an example.
par(mfrow=c(5,8)) plot(w) par(mfrow=c(1,1)) # allow up to 40 plots per page # invokes plot.data.frame since w is a data frame # reset to one plot per screen

See Section 11.4 for examples of the use of the trellis library instead of datadensity for drawing “strip plots” for depicting data distributions and data densities strati?ed by other variables. When you permanently store the result of the describe function (here, in w.des), you can quickly replay it as needed, either by printing it by simply stating its name, or by using page to put it in a new window. If page had already been run with multi=T you merely click on that window’s icon to restore it. Note that the page command4 causes the pop-up window to remain after you exit from S-Plus when multi=T. That way you can open the data description whether you are currently in S-Plus or not. In addition to displaying the w.des object, you can easily display any subset of the variables it describes:
w.des[20:30] page(w.des[c(1:10,30:40)]) w.des[c(’age’,’sex’)] w.des$age # # # # display description of variables 20-30 page display variables 1-10, 30-40 display 2 variables display single variable

4 This is true for Windows, and for UNIX if you set your pager to be a window utility such as xless. An excellent pager for Windows is the PFE editor described in Section 1.9. You can set this up by typing options(pager=’/pfe/pfe32’) or clicking on Options ... General Settings ... Computation, for example. Then by using multiple commands of the form page(object,multi=T) you can have PFE manage all of the pager windows, as by default PFE will add new open ?les when it is called repeatedly, i.e., it will not invoke an entirely new copy of pfe32.exe. Perhaps an even better pager is an Emacs client. In Windows 95/NT you would set this up by using the command options(pager=’gnuclient -q’).

70

CHAPTER 3. DATA IN S

Chapter 4

Operating in S

4.1

Reading and Writing Data Frames and Variables

In the introduction we created a subdirectory of your working directory called .Data (or _Data) because this allows for more organized data management, and because this is the default location in which S-Plus places new data. This way, all the objects that you create for a particular project are available since S-Plus will search by default in .Data if it exists. However, .Data is not the only directory available to you to store or search for objects. By default, when you start S, a search list is established and a series of directories is accessed sequentially looking for objects or functions. Said list can be modi?ed. The function to display the search list is search(). Its purpose is similar to the PATH command in DOS or UNIX. search() will give us a list of all the directories that S searches looking for functions and data. > library(Hmisc, T) > library(Design,T) > search() [1] "_Data" [2] "D:\\SPLUSWIN\\library\\Design\\_Data" [3] "D:\\SPLUSWIN\\library\\hmisc\\_Data" [4] "D:\\SPLUSWIN\\splus\\_Functio" [5] "D:\\SPLUSWIN\\stat\\_Functio" [6] "D:\\SPLUSWIN\\s\\_Functio" [7] "D:\\SPLUSWIN\\s\\_Dataset" [8] "D:\\SPLUSWIN\\stat\\_Dataset" [9] "D:\\SPLUSWIN\\splus\\_Dataset" [10] "D:\\SPLUSWIN\\library\\trellis\\_Data" The above search list contains directories, but you can also attach data frames to the list. When a data frame is in the search list, the variables within that data frame are available without using 71

72 the name of the data frame as a pre?x to the variable name.

CHAPTER 4. OPERATING IN S

4.1.1

The attach and detach Functions

To be able to reference objects (data frames, functions, vectors, etc.) that are not in the default search path, you can use the attach function. The main argument to attach is a directory name in single or double quotes or the name of a data frame or list without quotes. As an example, let us attach another directory that contains a variety of S objects. Recall that even in Windows we can specify forward slashes in ?le and directory names inside of S-Plus. You can also use a backward slash but it must be doubled, as \ is an escape character when inside character strings. > attach(’c:/analyses/support/_Data’) > search() [1] "_Data" [2] "c:/analyses/support/_Data" [3] "D:\\SPLUSWIN\\library\\Design\\_Data" [4] "D:\\SPLUSWIN\\library\\hmisc\\_Data" [5] "D:\\SPLUSWIN\\splus\\_Functio" [6] "D:\\SPLUSWIN\\stat\\_Functio" [7] "D:\\SPLUSWIN\\s\\_Functio" [8] "D:\\SPLUSWIN\\s\\_Dataset" [9] "D:\\SPLUSWIN\\stat\\_Dataset" [10] "D:\\SPLUSWIN\\splus\\_Dataset" [11] "D:\\SPLUSWIN\\library\\trellis\\_Data" Now list the individual objects in /analyses/support/_Data, which is in search position 2. The objects function (a replacement for an older function, ls) will do this.
> objects(2) [1] ".First" [5] "combined" [9] "last.dump" ".Last.value" "combphys" "mdemoall" ".Random.seed" "backward" "desc.combined" "dnrprob"

The objects.summary function will provide a more detailed listing. First let’s ?nd out how to call it.
> args(objects.summary) function(names. = NULL, what = c("data.class", "storage.mode", "extent", "object.size", "dataset.date"), where = 1, frame = NULL, pattern = NULL, data.class. = NULL, storage.mode. = NULL, mode. = "any", all.classes = F, order. = NULL, reverse = F, immediate = T) > objects.summary(where=2) data.class storage.mode extent object.size .First function function 1 282 .Last.value describe list 14 11904 .Random.seed numeric integer 12 81 backward data.frame list 6201 x 9 280180 combined data.frame list 10281 x 150 7610275

4.1. READING AND WRITING DATA FRAMES AND VARIABLES
combphys data.frame list desc.combined describe list dnrprob data.frame list last.dump list list mdemoall data.frame list dataset.date .First 96.04.11 6:28 .Last.value 97.04.11 10:18 .Random.seed 0 backward 96.04.11 6:31 combined 97.04.08 14:56 combphys 97.04.11 10:18 desc.combined 97.04.08 15:01 dnrprob 96.09.17 17:23 last.dump 97.03.06 14:07 mdemoall 97.04.11 10:18 10281 x 166 152 10281 x 27 3 1757 x 14 7025122 129733 1283865 353 136496

73

For examples to follow we will use the data frames pbc and prostate. You may obtain these from the Vanderbilt Biostatistics web site under Datasets. The ?le su?xes are .sdd so they may be easily imported as S-Plus transport ?les using File ... Import. Let us suppose these datasets have already been imported into the current project area’s ?ata area. If you are using R or a recent D version of the Hmisc library (with wget.exe installed if using Windows) you can easily download and access datasets from the Vanderbilt web site using the Hmisc library’s getHdata function. > getHdata(prostate) > find(prostate) [1] "_Data" # downloads, imports, runs cleanup.import

First let’s examine the variables in prostate using the describe function in Hmisc. We will ?rst call describe on individual variables. As prostate has not yet been attached, we must pre?x its variables with prostate.
> names(prostate) [1] "patno" "stage" "rx" "dtime" "status" "age" "wt" "pf" [9] "hx" "sbp" "dbp" "ekg" "hg" "sz" "sg" "ap" [17] "bm" "sdate" > describe(prostate$age) prostate$age : Age in Years n missing unique Mean .05 .10 .25 .50 .75 .90 .95 501 1 41 71.46 56 60 70 73 76 78 80 lowest : 48 49 50 51 52, highest: 84 85 87 88 89 ------------------------------------------------------------------------------> describe(prostate$rx) prostate$rx : Treatment n missing unique 502 0 4 placebo (127, 25%), 0.2 mg estrogen (124, 25%), 1.0 mg estrogen (126, 25%) 5.0 mg estrogen (125, 25%) -------------------------------------------------------------------------------

74

CHAPTER 4. OPERATING IN S

In this example names(prostate) gave us the variables in the data frame and describe( prostate$age ) and describe( prostate$rx ) some basic statistics on a couple of variables. describe recognizes automatically the type of variable (continuous, categorical (factor), or binary) and gives appropriate descriptive statistics (mean and quantiles, frequency table1 , or proportion, respectively), Except for binary variables, the 5 lowest and highest unique values are also given, and for any variable the sample size, number of unique values, and number of missing values is given. When the impute function has been used to impute missing values with “best guesses”, describe prints the number of imputed values. When the variable was imported from SAS using sas.get, special missing values were present, and the special.miss option was used, describe will also report the frequency of the various special missing values. Notice that since prostate is a data frame, we are using the $ notation to refer to its components. This can be rather inconvenient and cumbersome. To make things simpler, we can use the attach function to attach the data frame in position one (or two, or whatever) in the search list. By default, attach will place objects (which should be data frames or lists) in position 2. The remaining items move down one position.
> attach(prostate) > search() # Default placement is search position 2

[1] "_Data" [2] "prostate" [3] "c:/analyses/support/_Data" [4] "D:\\SPLUSWIN\\library\\Design\\_Data" [5] "D:\\SPLUSWIN\\library\\hmisc\\_Data" [6] "D:\\SPLUSWIN\\splus\\_Functio" . . . . > describe(age) age : Age in Years n missing unique Mean .05 .10 .25 .50 .75 .90 .95 501 1 41 71.46 56 60 70 73 76 78 80 lowest : 48 49 50 51 52, highest: 84 85 87 88 89 ------------------------------------------------------------------------------When the data frame (or any other recursive object, e.g., a list) is attached to the search list all its components can be accessed directly. This is the case regardless of the position on the search list. The advantage of using position one is that if you have another version of a variable in another dataframe or directory in the search list, then you can be sure you are operating on the intended version since the search list is accessed sequentially (i.e., we could have used attach(prostate,pos=1,use.names=F)). However, this will use more memory. If the object is attached in position one, all objects created from now on will be kept in memory and disappear when we quit S-Plus or detach the object unless we intstruct it to save them (using for example detach(1, ’prostate’)). Keep in mind that for large data frames the attach function may take a while to take e?ect and it will use a lot of memory. R does not support attaching a data frame in search position one, and at any rate this practice has been found to cause major problems to many programmers, especially those forgetting to detach the data frame upon completion of the modi?cations to it.
1 If

the variable has more than 20 unique values, the frequency table is omitted.

4.1. READING AND WRITING DATA FRAMES AND VARIABLES

75

Another way to make attach use less memory in S-Plus is to specify the use.names=F parameter2 . By default, attaching a data frame causes the row.names attribute of the data frame to be copied to each object within the frame, as that object’s name attribute. When for example the row.names represent a subject ID, this can be helpful in identifying observations. But this can result in a doubling of memory usage. It is more e?cient to associate names with only the variables whose observations you need to identify, or to just reference the row.names. The example below illustrates these.
> > > > > > attach(titanic, use.names=F) record.id ← row.names(titanic) names(pclass) ← names(age) ← record.id # This isn’t so effective here as row.names(titanic) were just # record numbers in character form, not passenger names # We could have done names(pclass) ← name

The function to take the data frame o? the search list is detach. It has two arguments, what and save. what is usually a number denoting a postion in the search list and save could be a character string with the name of the object where we will store the (possibly) modi?ed data frame.
> attach(prostate,pos=1,use.names=F) > ageg50 ← age[age>50] > length(ageg50) [1] 497 > sqrt.age ← sqrt(age) > length(sqrt.age) [1] 502 > detach(1,save="pros") Deleted before detaching: ageg50

Here we had the data frame prostate attached in position one. We created two new vectors, ageg50 and sqrt.age. Since ageg50 is shorter than the rest of the variables in the data frame it was deleted before detaching and not added to the new data frame pros.
> names(pros) [1] "patno" "stage" [7] "wt" "pf" [13] "hg" "sz" [19] "sqrt.age" "rx" "hx" "sg" "dtime" "sbp" "ap" "status" "dbp" "bm" "age" "ekg" "sdate"

sqrt.age is a new variable. We could have also said detach(prostate,save=F) which would have deleted sqrt.age before detaching. This form works much faster than trying to save new variables. There is a way to save the value of ageg50 with the dataframe by making it into a parametrized dataframe. See Spector’s book page 37 for an example. Whether it makes any sense to do this is another matter. Also, we question whether it is useful to create easily derived variables such as sqrt.age, as sqrt(age) may be used in any future S expression where age is analyzed. See Section 4.4.3. Because attach modi?es the search list, its use is sometimes to be discouraged. In R the with function is an excellent substitute in many contexts. This allows one to reference variables inside a data frame using for example
2R

does not have this parameter, and does not put data frame row.names as names attribute of vectors.

76
with(prostate, tapply(age, stage, mean, na.rm=T))

CHAPTER 4. OPERATING IN S

Multiple commands may reference variables inside a data frame using for example
with(prostate, { ma ← mean(age, na.rm=T) fr ← table(stage) print(ma) })

R also allows the analyst to add new variables to a data frame or to recompute existing variables without attach and detach using the transform function.

4.1.2

Subsetting Data Frames

In many cases, one analyzes all of the observations and most of the variables in a data frame. If a subset of the data needs to be analyzed for a small part of the job, one can easily process temporary subsets as in the following examples.
plot(age[sex==’male’],height[sex==’male’]) s ← sex==’male’ plot(age[s], height[s]) # equivalent to last example f ← lrm(death ? age*height, subset=sex==’male’)

When you want to subset the observations or variables in a data frame for an entire sequence of operations, it may be better to subset the entire data frame. You can do this by creating a new data frame using
df.males ← df[df$sex==’male’,]

but more typically by attaching a subset of the data frame. Here are several examples. One of them uses the %nin% operator in the Hmisc library, which returns a vector of T and F values according to whether the corresponding element of the ?rst vector is not contained in the second vector. %nin% is the opposite of the %in% operator in Hmisc.
attach(df[,c(’age’,’sex’)]) # only make age and sex available - save memory attach(df[c(’age’,’sex’)]) # another way to subset variables using fact # that df is a list in addition to a data frame attach(df[,Cs(age,sex)]) # use the Cs function in Hmisc to save quoting attach(df[df$sex==’male’,]) # get all variables but only for males # need df$sex instead of sex because attach # hasn’t taken effect yet attach(df[1:100,c(1:2,4:7)])# get first 100 rows and variables 1,2,4,5,6,7 attach(df[,-4]) # don’t get variable number 4 attach(df[,names(df) %nin% c(’age’,’sex’)]) # get all but age and sex attach(df[df$treat %in% c(’a’,’b’,’d’), names(df) %nin% Cs(age,sex)]) # get rows for treatments a,b,d and all but 2 var attach(df[!(is.na(df$age) | is.na(df$sex)),]) # omit rows containing NAs attach(df[!is.na(df$age+df$height),]) # shortcut if both vars numeric

4.1. READING AND WRITING DATA FRAMES AND VARIABLES

77

After the attach is in e?ect, referencing any of the included variables will reference the desired subset of rows of the data frame which were attached. In some ways a more elegant approach is to use the Hmisc subset function which is a copy of the R subset function. The advantages of subset are that variable names do not need pre?xing by dataframe$, and subset provides an elegant notation for subsetting variables by looking up column numbers corresponding to column names given by the user, which allows consecutive variables to keep or drop to be speci?ed. Here are some examples:
> # Subset a simple vector > x1 ← 1:4 > sex ← rep(c(’male’,’female’),2) > subset(x1, sex==’male’) [1] 1 3 > # Subset a data frame > d ← data.frame(x1=x1, x2=(1:4)/10, x3=(11:14), sex=sex) > d x1 x2 x3 sex 1 1 0.1 11 male 2 2 0.2 12 female 3 3 0.3 13 male 4 4 0.4 14 female > subset(d, sex==’male’) x1 x2 x3 sex 1 1 0.1 11 male 3 3 0.3 13 male > subset(d, sex==’male’ & x2>0.2) x1 x2 x3 sex 3 3 0.3 13 male > subset(d, x1>1, select=-x1) x2 x3 sex 2 0.2 12 female 3 0.3 13 male 4 0.4 14 female > subset(d, select=c(x1,sex)) x1 sex 1 1 male 2 2 female 3 3 male 4 4 female > subset(d, x2<0.3, select=x2:sex) x2 x3 sex 1 0.1 11 male 2 0.2 12 female

78
> subset(d, x2<0.3, -(x3:sex)) x1 x2 1 1 0.1 2 2 0.2 > attach(subset(d, sex==’male’ & x3==11, x1:x3))

CHAPTER 4. OPERATING IN S

4.1.3

Adding Variables to a Data Frame without Attaching

Attaching your data frame in search position one will allow you to add or change any number of variables. There are other ways to add new variables to an existing data frame if you don’t want to have the overhead of attaching it. Suppose that we wish to add two variables, x1 and x2, to an existing data frame called df. Here are two approaches:
df$x1 ← pmax(df$y1, df$y2, df$y3) df$x2 ← (df$y1 + df$y2 + df$y3) / 3 df ← data.frame(df, x1=pmax(df$y1, df$y2, df$y3), x2=(df$y1 + df$y2 + df$y3)/3)

4.1.4

Deleting Variables from a Data Frame

Setting a variable to the NULL value will cause it to be deleted permanently from the list3 :
df$age ← NULL df[c(’age’,’sex’)] ← NULL df[Cs(age,sex)] ← NULL # delete 2 variables # same thing

To remove variables that are inside a data frame currently attached in position 1, use statments such as the following.
age ← NULL sex ← pressure ← NULL

Do not use rm(varname), remove(’varname’), or remove(’df$varname’) to remove a variable from a data frame. Use one of

赞助商链接
更多相关文档:

10所国外著名交通工程方面大学介绍

An Introduction to Intelligent Transportation Systems... and design to system optimization, transportation ...universities and research institutes in the U.S....

The introduction of the JSP

The introduction of the JSP_电脑基础知识_IT/...JSP page logic and web page design and display ...(2)Java's speed is class to complete the ...

精品课程下载(各种大学)study139

20. 21. An Introduction to Genetic 基因工程概论 Engineering Gene Cloning ...爱达 TN433 FL48 The Design of CMOS CMOS工艺设计RF集成电路 Radio-Frequency...

An Introduction to the Joint Modeling and Simulatio...

Introduction to System S... 暂无评价 24页 免费 Modeling and Simulation ....JMASS was originally designed to support the highfidelity engagement level ...

Introduction to reinforced concrete and earth wor

Introduction to reinforced concrete and earth wor_...reinforces concrete section from the homogeneity of...Hence every design is an analysis once a trial ...

机电工程专业英语第二版Lesson1课后作业

(1) Provides an introduction to the design process, problem formulation, ...s creative efforts emerge into actual products and processes that benefit ...

ART AND DESIGN WORKSHOP

ART AND DESIGN WORKSHOP, Art Department at HIU With an introduction to ...(Duchamps Ready-made) Part 2 ‘Touch with the eyes’ Most images ...

Structure of Diesel Engine and the introduction of ...

兰州交通大学毕业设计(论文) Structure of Diesel Engine and the introduction ...The engine's predecessor was designed by Cooper Bessemer and was adopted by...

Computer Aided Design and Construction - bsc(Hons) ...

9 Introduction to comp... 17页 免费 CAD技巧...The programme is designed to provide students with...visualisations using advanced modelling and animation...

The+Urban+Design+Reader++目录

Author's Introduction and The Town-Country Magnet Ebenezer Howard Ideology ...Jacobs Toward an Urban Design Manifesto Allan B. Jacobs and Donald Appleyard...

更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com