Data Visualization and Fraud Detection - Doing Data Science

Databases Reference

In-Depth Information

At Square they try to maintain reusability and readability by struc‐

turing code in different folders with distinct, reusable components that

provide semantics around the different parts of building a machine

learning model:

Model

The learning algorithms

Signal

Data ingestion and feature computation

Error

Performance estimation

Experiment

Scripts for exploratory data analysis and experiments

Test

Test all the things

They only write scripts in the experiments folder where they either tie

together components from model, signal, and error, or conduct ex‐

ploratory data analysis. Each time they write a script, it's more than

just a piece of code waiting to rot. It's an experiment that is revisited

over and over again to generate insight.

What does such a discipline give you? Every time you run an experi‐

ment, you should incrementally increase your knowledge. If that's not

happening, the experiment is not useful. This discipline helps you

make sure you don't do the same work again. Without it you can't even

figure out the things you or someone else has already attempted. Ian

further claims that “If you don't write production code, then you're

not productive.”

For more on what every project directory should contain, see Project

Template by John Myles White. For those students who are using R

for their classes, Ian suggests exploring and actively reading Github's

repository of R code. He says to try writing your own R package, and

make sure to read Hadley Wickham's devtools wiki . Also, he says that

developing an aesthetic sense for code is analogous to acquiring the

taste for beautiful proofs; it's done through rigorous practice and feed‐

back from peers and mentors.

For extra credit, Ian suggests that you contrast the implementations

of the caret package with scikit-learn . Which one is more extendable

and reusable? Why?

Search WWH ::

Custom Search

Home