Introduction - Data Science at the Command Line

Database Reference

In-Depth Information

We convert the JSON data to CSV using json2csv and store it as fashion.csv.

With wc -l (Rubin & MacKenzie, 2012), we find out that this data set contains 4,855

articles (and not 5,000 because we probably retrieved everything from 2009):

$ wc -l fashion.csv

4856 fashion.csv

Let's inspect the first 10 articles to verify that we have succeeded in obtaining the data.

Note that we're applying cols (Janssens, 2014) and cut (Ihnat, MacKenzie, & Meyer‐

ing, 2012) to the date column in order to leave out the time and time zone informa‐

tion in the table:

$ < fashion.csv cols -c date cut -dT -f1 | head | csvlook

|-------------+------------+-----------------------------------------|

| date | type | title |

|-------------+------------+-----------------------------------------|

| 2009-02-15 | multimedia | Michael Kors |

| 2009-02-20 | multimedia | Recap: Fall Fashion Week, New York |

| 2009-09-17 | multimedia | UrbanEye: Backstage at Marc Jacobs |

| 2009-02-16 | multimedia | Bill Cunningham on N.Y. Fashion Week |

| 2009-02-12 | multimedia | Alexander Wang |

| 2009-09-17 | multimedia | Fashion Week Spring 2010 |

| 2009-09-14 | multimedia | A Designer Reinvents Himself |

|-------------+------------+-----------------------------------------|

That seems to have worked! In order to gain any insight, we'd better visualize the data.

Figure 1-3 contains a line graph created with R (R Foundation for Statistical Comput‐

ing, 2014), Rio (Janssens, 2014), and ggplot2 (Wickham, 2009).

$ < fashion.csv Rio -ge 'g + geom_freqpoly(aes(as.Date(date), color=type), ' \

> 'binwidth=7) + scale_x_date() + labs(x="date", title="Coverage of New York' \

> ' Fashion Week in New York Times")' | display

By looking at the line graph, we can infer that New York Fashion Week happens two

times per year. And now we know when: once in February and once in September.

Let's hope that it's going to be the same this year so that we can prepare ourselves! In

any case, we hope that with this example, we've shown that he New York Times API is

an interesting source of data. More importantly, we hope that we've convinced you

that the command line can be a very powerful approach for doing data science.

In this section, we've peeked at some important concepts and some exciting

command-line tools. Don't worry if some things don't make sense yet. Most of the

concepts will be discussed in Chapter 2 , and in the subsequent chapters we'll go into

more detail for all the command-line tools used in this section.

Search WWH ::

Custom Search

Home