Database Reference
In-Depth Information
8.9.2 S eMantic S paM
Semantic spam is the term given to the misuse of Linked Data, or misrepresentation
of information within Linked Data, to direct a semantic search engine or Semantic
Web application to a spammer's data or Web site. When creating your own Linked
Data and linking to other datasets, it is important to be aware of the tricks that could
be used to insert false data. This will help you avoid using such techniques, however
innocently, as semantic search engines will no doubt soon begin to detect and filter
out datasets that employ these methods. This problem is in its infancy, but Ian Davis 22
has identified a number of semantic spam techniques. These include false labeling,
identity assumption, false provenance, and manipulation of content negotiation.
In false labeling, well-regarded subject URIs are assigned an rdfs:label with
spam content. Since rdfs:labels are often used for human-readable display when
denoting an RDF resource, the spammer's message might well appear prominently in
the Linked Data application in place of, say, Tim Berners-Lee's URI. Spam objects
can also be inserted as the objects of triples involving other predicates commonly used
to hold human-readable content, such as isPrimaryTopicOf or rdfs:seeAlso ,
or it is even possible for the predicate itself to be given a spam rdfs:label . Another
direction of spam attack is identity assumption; owl:sameAs is used to miscon-
nect a popular resource to a false resource that promotes the spam message. Since
owl:sameAs is so widespread, many Linked Data applications use it for aggre-
gating all triples about the subject together, so when querying for all data about
dbpedia:London , say, you could find that spam triples are returned as well.
Another opportunity for spammers is false provenance; they attribute their
message to a well-known and trustworthy person, for example, by saying
http://mereamaps.gov.me/PR/666 a bibo:Quote ;
bibo:content 'I always drink at the Isis Tavern' ;
dc:creator 'Tim Berners-Lee'.
This quotation could be displayed by a Linked Data application, along with its attri-
bution, thus misleading consumers. A twist on this misattribution is to state the URI
of a trusted individual or organization as the object of the triple instead of merely the
text “Tim Berners-Lee.”
Another trick outlined by Ian Davis is when useful Linked Data is supplied to
the software agent, but spam messages are provided to humans, by manipulation
of content negotiation processes. While a Linked Data application will make an
HTTP request, using a Web browser aiming to supply human-readable information
will send a different HTTP request, so the spam server can send different content
to the two. This problem then means that it is particularly important for you as a
well-regarded, nonspamming publisher to supply the same semantics in your Linked
Data as human-readable content, as this is soon something that antispam filters will
find a way to test for. (This recommendation is similar to the best practice guideline
in the traditional Web of not hiding any data or links from the user.)
In addition to these “spam vectors” identified by Davis, you should bear in mind
the issue with generating your own incoming links, a technique we suggested in
Search WWH ::




Custom Search