Exploring Data - Data Science at the Command Line

Database Reference

In-Depth Information

• Imports required packages

• Loads the CSV file as a data.frame

• Generates a ggplot2 object if needed (more on this in the next section)

• Runs the specified commands

• Prints the result of the last command to standard output

So now, if you wanted to do one or two things to your data set with R, you can specify

it as a one-liner, and keep on working on the command line. All the knowledge that

you already have about R can now be leveraged from the command line. With Rio ,

you can even create sophisticated visualizations, as you'll see later in this chapter.

Rio doesn't have to be used as a filter, meaning the output doesn't have to be in CSV

format per se. You can compute various descriptive statistics:

$ < data/iris.csv Rio -e 'mean(df$sepal_length)'

5.843333

$ < data/iris.csv Rio -e 'sd(df$sepal_length)'

0.8280661

$ < data/iris.csv Rio -e 'sum(df$sepal_length)'

876.5

And if we wanted to compute the five summary statistics, we would do:

$ < data/iris.csv Rio -e 'summary(df$sepal_length)'

Min. 1st Qu. Median Mean 3rd Qu. Max.

4.300 5.100 5.800 5.843 6.400 7.900

You can also compute the skewness (symmetry of the distribution) and kurtosis

(peakedness of the distribution), but then you need to have the moments package

installed:

$ < data/iris.csv Rio -e 'skewness(df$sepal_length)'

$ < data/iris.csv Rio -e 'kurtosis(df$petal_width)'

Correlation between two features:

$ < dat/iris.csv Rio -e 'cor(df$bill, df$tip)'

0.6757341

Or even a correlation matrix:

$ < data/tips.csv csvcut -c bill,tip | Rio -f cor | csvlook

|--------------------+--------------------|

| bill | tip |

|--------------------+--------------------|

| 1 | 0.675734109211365 |

| 0.675734109211365 | 1 |

|--------------------+--------------------|

Search WWH ::

Custom Search

Home