Using Linked Data - Linked Data: A Geographic Perspective

Database Reference

In-Depth Information

8.9.2 S eMantic S paM

Semantic spam is the term given to the misuse of Linked Data, or misrepresentation

of information within Linked Data, to direct a semantic search engine or Semantic

Web application to a spammer's data or Web site. When creating your own Linked

Data and linking to other datasets, it is important to be aware of the tricks that could

be used to insert false data. This will help you avoid using such techniques, however

innocently, as semantic search engines will no doubt soon begin to detect and filter

out datasets that employ these methods. This problem is in its infancy, but Ian Davis 22

has identified a number of semantic spam techniques. These include false labeling,

identity assumption, false provenance, and manipulation of content negotiation.

In false labeling, well-regarded subject URIs are assigned an rdfs:label with

spam content. Since rdfs:labels are often used for human-readable display when

denoting an RDF resource, the spammer's message might well appear prominently in

the Linked Data application in place of, say, Tim Berners-Lee's URI. Spam objects

can also be inserted as the objects of triples involving other predicates commonly used

to hold human-readable content, such as isPrimaryTopicOf or rdfs:seeAlso ,

or it is even possible for the predicate itself to be given a spam rdfs:label . Another

direction of spam attack is identity assumption; owl:sameAs is used to miscon-

nect a popular resource to a false resource that promotes the spam message. Since

owl:sameAs is so widespread, many Linked Data applications use it for aggre-

gating all triples about the subject together, so when querying for all data about

dbpedia:London , say, you could find that spam triples are returned as well.

Another opportunity for spammers is false provenance; they attribute their

message to a well-known and trustworthy person, for example, by saying

http://mereamaps.gov.me/PR/666 a bibo:Quote ;

bibo:content 'I always drink at the Isis Tavern' ;

dc:creator 'Tim Berners-Lee'.

This quotation could be displayed by a Linked Data application, along with its attri-

bution, thus misleading consumers. A twist on this misattribution is to state the URI

of a trusted individual or organization as the object of the triple instead of merely the

text “Tim Berners-Lee.”

Another trick outlined by Ian Davis is when useful Linked Data is supplied to

the software agent, but spam messages are provided to humans, by manipulation

of content negotiation processes. While a Linked Data application will make an

HTTP request, using a Web browser aiming to supply human-readable information

will send a different HTTP request, so the spam server can send different content

to the two. This problem then means that it is particularly important for you as a

well-regarded, nonspamming publisher to supply the same semantics in your Linked

Data as human-readable content, as this is soon something that antispam filters will

find a way to test for. (This recommendation is similar to the best practice guideline

in the traditional Web of not hiding any data or links from the user.)

In addition to these “spam vectors” identified by Davis, you should bear in mind

the issue with generating your own incoming links, a technique we suggested in

Linked Data: A Geographic Perspective

Search WWH ::

Custom Search

Home