Database Reference
In-Depth Information
3.1 Introduction to R
R is a programming language and software framework for statistical analysis and
graphics. Available for use under the GNU General Public License [1], R software
and installation instructions can be obtained via the Comprehensive R Archive and
Network [2]. This section provides an overview of the basic functionality of R. In
later chapters, this foundation in R is utilized to demonstrate many of the presented
analytical techniques.
Before delving into specific operations and functions of R later in this chapter, it
is important to understand the flow of a basic R script to address an analytical
problem. The following R code illustrates a typical analytical situation in which a
dataset is imported, the contents of the dataset are examined, and some modeling
building tasks are executed. Although the reader may not yet be familiar with the
R syntax, the code can be followed by reading the embedded comments, denoted
by
#
. In the following scenario, the annual sales in U.S. dollars for 10,000 retail
customers have been provided in the form of a comma-separated-value (CSV) file.
The
read.csv()
function is used to import the CSV file. This dataset is stored to
the R variable
sales
using the assignment operator
<-
.
# import a CSV file of the total annual sales for each
customer
sales <- read.csv("c:/data/yearly_sales.csv")
# examine the imported dataset
head(sales)
summary(sales)
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main="Number of Orders vs. Sales")
# perform a statistical analysis (fit a linear regression
model)
results <- lm(sales$sales_total ˜ sales$num_of_orders)
summary(results)
# perform some diagnostics on the fitted model
# plot histogram of the residuals
hist(results$residuals, breaks = 800)
In this example, the data file is imported using the
read.csv()
function. Once the
file has been imported, it is useful to examine the contents to ensure that the data