Database Reference
In-Depth Information
#import a CSV file of the total annual sales for each
customer
sales <- read.csv("c:/data/yearly_sales.csv")
is.data.frame(sales)
# returns TRUE
As seen earlier, the variables stored in the data frame can be easily accessed using
the $ notation. The following R code illustrates that in this example, each variable
is a vector with the exception of gender , which was, by a read.csv() default,
imported as a factor . Discussed in detail later in this section, a factor denotes a
categorical variable, typically with a few finite levels such as “F” and “M” in the case
of gender .
length(sales$num_of_orders) # returns 10000 (number of
customers)
is.vector(sales$cust_id) # returns TRUE
is.vector(sales$sales_total) # returns TRUE
is.vector(sales$num_of_orders) # returns TRUE
is.vector(sales$gender)
# returns FALSE
is.factor(sales$gender)
# returns TRUE
Because of their flexibility to handle many data types, data frames are the preferred
input format for many of the modeling functions available in R. The following
use of the str() function provides the structure of the sales data frame. This
function identifies the integer and numeric (double) data types, the factor variables
and levels, as well as the first few values for each variable.
str(sales) # display structure of the data frame object
'data.frame': 10000 obs. of 4 variables:
$ cust_id : int 100001 100002 100003 100004 100005 100006
$ sales_total : num 800.6 217.5 74.6 498.6 723.1 …
$ num_of_orders: int 3 3 2 3 4 2 2 2 2 2 …
$ gender : Factor w/ 2 levels "F","M": 1 1 2 2 1 1 2 2 1
2 …
In the simplest sense, data frames are lists of variables of the same length. A subset
of the data frame can be retrieved through subsetting operators . R's subsetting
operators are powerful in that they allow one to express complex operations in a
succinct fashion and easily retrieve a subset of the dataset.
Search WWH ::




Custom Search