Database Reference
In-Depth Information
#import a CSV file of the total annual sales for each
customer
sales <- read.csv("c:/data/yearly_sales.csv")
is.data.frame(sales)
# returns TRUE
As seen earlier, the variables stored in the data frame can be easily accessed using
the
$
notation. The following R code illustrates that in this example, each variable
is a vector with the exception of
gender
, which was, by a
read.csv()
default,
imported as a
factor
. Discussed in detail later in this section, a factor denotes a
categorical variable, typically with a few finite levels such as “F” and “M” in the case
of
gender
.
length(sales$num_of_orders)
# returns 10000 (number of
customers)
is.vector(sales$cust_id)
# returns TRUE
is.vector(sales$sales_total)
# returns TRUE
is.vector(sales$num_of_orders)
# returns TRUE
is.vector(sales$gender)
# returns FALSE
is.factor(sales$gender)
# returns TRUE
Because of their flexibility to handle many data types, data frames are the preferred
input format for many of the modeling functions available in R. The following
use of the
str()
function provides the structure of the
sales
data frame. This
function identifies the integer and numeric (double) data types, the factor variables
and levels, as well as the first few values for each variable.
str(sales)
# display structure of the data frame object
'data.frame': 10000 obs. of 4 variables:
$ cust_id : int 100001 100002 100003 100004 100005 100006
…
$ sales_total : num 800.6 217.5 74.6 498.6 723.1 …
$ num_of_orders: int 3 3 2 3 4 2 2 2 2 2 …
$ gender : Factor w/ 2 levels "F","M": 1 1 2 2 1 1 2 2 1
2 …
In the simplest sense, data frames are lists of variables of the same length. A subset
of the data frame can be retrieved through
subsetting operators
. R's subsetting
operators are powerful in that they allow one to express complex operations in a
succinct fashion and easily retrieve a subset of the dataset.