Databases Reference
In-Depth Information
Jake's Exercise: Naive Bayes for Article
Classification
This problem looks at an application of Naive Bayes for multiclass text
classification. First, you will use the New York Times Developer API
to fetch recent articles from several sections of the Times . Then, using
the simple Bernoulli model for word presence, you will implement a
classifier which, given the text of an article from the New York
Times , predicts the section to which the article belongs.
First, register for a New York Times Developer API key and request
access to the Article Search API. After reviewing the API documen‐
tation, write code to download the 2,000 most recent articles for each
of the Arts, Business, Obituaries, Sports, and World sections. (Hint:
Use the nytd_section_facet facet to specify article sections.) The de‐
veloper console may be useful for quickly exploring the API. Your code
should save articles from each section to a separate file in a tab-
delimited format, where the first column is the article URL, the second
is the article title, and the third is the body returned by the API.
Next, implement code to train a simple Bernoulli Naive Bayes model
using these articles. You can consider documents to belong to one of
C categories, where the label of the i th document is encoded as
y i ∈0, 1, 2, . . . C —for example, Arts = 0, Business = 1, etc.—and docu‐
ments are represented by the sparse binary matrix X , where X i j = 1
indicates that the i th document contains the j th word in our dictio‐
nary.
You train by counting words and documents within classes to estimate
θ jc and θ c :
n jc + α −1
n c + α + β −2
θ jc =
n c
n
θ c =
where n jc is the number of documents of class c containing the j th
word, n c is the number of documents of class c , n is the total number
of documents, and the user-selected hyperparameters α and β are
pseudocounts that “smooth” the parameter estimates. Given these es‐
timates and the words in a document x , you calculate the log-odds for
Search WWH ::




Custom Search