Database Reference
In-Depth Information
Table 9.3 Example Regular Expressions
Regular
Expression
Matches
Note
b(P|p)hone bPhone, bphone
Pipe “|” means “or”
bEbk, bEbok, bEbook,
bEboook, bEbooook,
bEboooook, …
“*” matches zero or more
occurrences of the preceding letter
bEbo*k
“+” matches one or more
occurrences of the preceding letter
bEbo{2,4}k bEbook, bEboook, bEbooook “{2,4}” matches from two to four
repetitions of the preceding letter
“o”
bEbok, bEbook, bEboook,
bEbooook, bEboooook, …
bEbo+k
Text starting with “I love”
“^” matches the start of a string
^I love
Text ending with “ACME”
“$” matches the end of a string
ACME$
This section has discussed three different sources where raw data may come from:
tweets that contain keywords bPhone or bEbook , related articles from news
portals and blogs, and comments on ACME's products from online shops or
reviews sites.
If one chooses not to build a data collector from scratch, many companies such as
GNIP [9] and DataSift [10] can provide data collection or data reselling services.
Depending on how the fetched raw data will be used, the Data Science team needs
to be careful not to violate the rights of the owner of the information and user
agreements about use of websites during the data collection. Many websites place
a file called robots.txt in the root directory—that is, http://…/robots.txt
(for example, http://www.amazon.com/robots.txt ). It lists the directories
and files that are allowed or disallowed to be visited so that web scrapers or web
crawlers know how to treat the website correctly.
Search WWH ::




Custom Search