DATA (Social Science)

The word data (singular, datum) is originally Latin for "things given or granted." Because of its humble and generic meaning, the term enjoys considerable latitude both in technical and common usage, for almost anything can be referred to as a "thing given or granted" (Cherry 1978). With reasonable approximation, four principal interpretations may be identified in the literature. The first three capture part of the nature of the concept and are discussed in the next section. The fourth is the most fundamental and satisfactory, so it is discussed separately in the subsequent section. Further clarifications about the nature of data are also introduced. A reminder about the social, legal, and ethical issues raised by the use of data concludes this entry.

THREE INTERPRETATIONS OF THE CONCEPT OF DATA

According to the epistemic (i.e., knowledge-oriented) interpretation, data are collections of facts. In this sense, data provide the basis for further reasoning—as when one speaks of data as the givens of a mathematical problem— or represent the basic assumptions or empirical evidence on which further evaluations can be based, as in a legal context. The limits of this interpretation are mainly two. First, it is overly restrictive in that it fails to explain, for example, processes such as data compression (any encoding of data that reduces the number of data units to represent some unencoded data; see Sayood [2006] for an introduction) or data cryptography (any procedure used to transform available data into data accessible only by their intended recipient; see Singh [1999] for an introduction), which may apply to facts only in a loosely metaphorical sense. Second, it trades one difficult concept (data) for an equally difficult one (facts), when actually facts are more easily understood as the outcome of data processing. For example, census data may establish a number of facts about the composition of a population.


According to the informational interpretation, data are information. In this sense, for example, personal data are equivalent to information about the corresponding individual. This interpretation is useful in order to make sense of expressions such as data mining (information gathering; see Han and Kamber [2001] for an introduction) or data warehouse (information repository). However, two major shortcomings show its partial inadequacy. First, although it is important to stress how information depends on data, it is common to understand the former in terms of the latter, not vice versa: information is meaningful and truthful data—for example, "paper is inflammable" (Floridi 2003). So one is left with the problem of understanding what data are in themselves. Second, not all data are informational in the ordinary sense in which information is equivalent to some content (e.g., a railway timetable) about a referent (the schedule of trains from Oxford to London). A music CD may contain gigabytes of data, but no information about anything (Floridi 2005).

According to the computational interpretation, data are collections (sets, strings, classes, clusters, etc.) of binary elements (digits, symbols, electrical signals, magnetic patterns, etc.) processed and transmitted electronically by technologies such as computers and cellular phones. This interpretation has several advantages. It explains why pictures, music files, or videos are also constituted by data. It is profitably related both to the informational and to the epistemic interpretation, since a binary format is increasingly often the only one in which experimental observations or raw facts may be available and further manipulated (collected, stored, processed, etc.) to generate information, for example, in the course of scientific investigations (von Baeyer 2003). Finally, it highlights the malleable nature of data and hence the possibility of their automatic processing (Pierce 1980). The main limit of this interpretation lies in the confusion between data and the format in which data may be encoded. Data need not be discrete (digital); data can also be analog (continuous). A CD and a vinyl record both contain music data. Binary digits are only the most recent and common incarnation of data.

Given these interpretations, it seems wise to exercise some flexibility and tolerance when using the concept of data in different contexts. On the other hand, it is interesting to note that the aforementioned interpretations all presuppose a more fundamental definition of data, to which we now turn.

THE DIAPHORIC INTERPRETATION OF DATA

A good way to uncover the most fundamental nature of data is by trying to understand what it means to erase, damage, or lose data. Imagine the page of a topic encrypted or written in a language unknown to us. We have all the data, but we do not know the meaning, hence we have no information, facts, or evidence. Suppose the data are continuous pictograms. We still have all the data, but no binary bits. Let us now erase half of the pic-tograms. We may say that we have halved the data as well. If we continue in this process, when we are left with only one pictogram we might be tempted to say that data require, or may be identical with, some sort of representation. But now let us erase that last pictogram too. We are left with a white page, yet not without data. For the presence of a white page is still a datum, as long as there is a difference between the white page and the page on which something is written. Compare this to the common phenomenon of "silent assent": silence, or the lack of perceivable data, is as much a datum as the presence of some rumor, exactly like the zeros of a binary system. We shall return to this point presently, but at the moment it is sufficient to grasp that a genuine, complete erasure of all data can be achieved only by the elimination of all possible differences. This clarifies why a datum is ultimately reducible to a lack of uniformity.

More formally, according to the diaphoric interpretation (diaphora is the Greek word for "difference"), the general definition of a datum is: (D) datum = x being distinct from y, where x and y are two uninterpreted variables and the domain is left open to further interpretation.

This definition can be applied at three levels: (1) Data as diaphora de re, that is, as a lack of uniformity in the world (Seife 2006). There is no specific name for such "data in the wild." A possible suggestion is to refer to such data as dedomena ("data" in Greek; note that the word data comes from the Latin translation of a work by Euclid entitled Dedomena). Dedomena are not to be confused with environmental data. Dedomena are pure data or proto-epistemic data—that is, data before they are interpreted. Dedomena can be posited as an external anchor of information, for dedomena are never accessed or elaborated independently of a level of abstraction. They can be reconstructed as requirements for any further analysis: they are not experienced but their presence is empirically inferred from (and required by) experience. Of course, no example can be provided, but data as dedomena are whatever lack of uniformity in the world is the source of (what looks to information systems like us as) data—for example, a red light against a dark background.

(2) Data as diaphora de signo, that is, as a lack of uniformity between (the perception of) at least two physical states of a system, such as a higher or lower charge in a battery, a variable electrical signal in a telephone conversation, or the dot and the dash in the Morse alphabet.

(3) Data as diaphora de dicto, that is, as a lack of uniformity between two symbols of a code—for example, the letters A and B in the Latin alphabet.

Depending on one’s interpretation, dedomena in (1) may be either identical with, or what make possible, signals in (2), and signals in (2) are what make possible the coding of symbols in (3).

The dependence of information on the occurrence of well-structured data, and of data on the occurrence of differences (dedomena) variously implementable physically, explains why information can so easily be decoupled from its support. The actual format, medium, and language in which data (and hence information) are encoded are often irrelevant and hence disregardable. In particular, the same data may be analog or digital, printed on paper or viewed on a screen, in English or in some other language, expressed in words or pictures, or quantitative or qualitative.

Interpretations of the support-independence of data can vary radically, for the definition (D) above leaves underdetermined:

• the classification of the relata (taxonomic neutrality);

• the logical type to which the relata belong (typological neutrality);

• the dependence of their semantics on a producer (genetic neutrality).

We shall now look at each form of neutrality in turn.

Taxonomic Neutrality A datum is usually classified as the entity exhibiting the anomaly, often because the latter is perceptually more conspicuous or less redundant than the background conditions. However, the relation of inequality is binary and symmetric. A white sheet of paper is not just the necessary background condition for the occurrence of a black dot as a datum; it is a constitutive part of the (black-dot-on-white-sheet) datum itself, together with the fundamental relation of inequality that couples it with the dot. Nothing is a datum in itself. Rather, being a datum is an external property. This view is summarized by the principle of taxonomic neutrality (TaN): a datum is a relational entity.

The slogan is "data are relata," but the definition of data as differences is neutral with respect to the identification of data with specific relata. In our example, one may refrain from identifying either the red light or the white background as the datum.

Typological Neutrality Five classifications of different types of data as relata are common. They are not mutually exclusive, and the same data may fit different classifications depending on the circumstances, the type of analysis conducted, and the level of abstraction adopted.

Primary data are the principal data stored in, for example, a database. Such data may be a simple array of numbers. They are the data an information-management system is generally designed to convey (in the form of information) to the end user. Normally, when speaking of data one implicitly assumes that primary data are what is in question. So, by default, the flashing red light of the low-battery indicator is assumed to be an instance of primary data conveying primary information.

Secondary data are the converse of primary data, constituted by their absence. Clearly, silence may be very informative. This is a peculiarity of data: their absence may also be informative.

Metadata are indications about the nature of some other (usually primary) data. Metadata describe properties such as location, format, updating, availability, usage restrictions, and so forth. Correspondingly, metainforma-tion is information about the nature of information. The statement "’Rome is the capital of Italy’ is encoded in English" is a simple example.

Operational data are data regarding the operations of the entire data system and the system’s performance. Correspondingly, operational information is information about the dynamics of an information system. Suppose a car has a yellow light that, when flashing, indicates that the car-checking system is malfunctioning. The fact that the light is on may indicate that the low-battery indicator is not working properly, thus undermining the hypothesis that the battery is flat.

Derivative data can be extracted whenever data are used as indirect sources in a search for patterns, clues, or inferential evidence about something other than that directly addressed by the data themselves, as in comparative and quantitative analyses. Derivative data are used, for example, when one infers a person’s whereabouts at a given time from her credit card data and the purchase of gasoline at a certain gas station.

Let us now return to the question of whether or not there can be dataless information. The definition of data given above (D) does not specify which types of relata are in question, only that data are a matter of a relation of difference. This typological neutrality is justified by the fact that, when the apparent absence of data is not reducible to the occurrence of negative primary data, what becomes available and qualifies as information is some further non-primary information x about y constituted by some non-primary data z. For example, if a database query provides an answer, it will provide at least a negative answer—for example, "no documents found." This datum conveys primary negative information. However, if the database provides no answer, either it fails to provide any data at all, in which case no specific information is available (so the rule "no information without data" still applies) or it can provide some data to establish, for example, that it is running in a loop. Likewise, silence, this time as a reply to a question, could represent negative primary information such as implicit assent or denial, or it could carry nonprimary information—for example, about whether the person heard the question or about the level of noise in the room.

Genetic Neutrality Finally, let us consider the semantic nature of the data. How data can come to have an assigned meaning and function in a semiotic system in the first place is one of the hardest problems in semantics. Luckily, the question is not how but whether data constituting information as semantic content can be meaningful independently of an informee. The genetic neutrality (GeN) principle states that: GeN data can have a semantics independently of any informee.

Before the discovery of the Rosetta stone in 1799, ancient Egyptian hieroglyphics were already regarded as information, even if their semantics was beyond the comprehension of any interpreter. The discovery of an interface between Greek and Egyptian did not affect the semantics of the hieroglyphics, but only its accessibility. This is the weak sense in which meaningful data may be embedded in information-carriers informee-indepen-dently. GeN supports the possibility of information without an informed subject, and it is to be distinguished from the stronger, realist thesis (supported, for example, by Dretske [1981]), according to which data could also have their own semantics independently of an intelligent producer/informer.

CONCLUSION

Much social research involves the study of logical relationships between sets of attributes (variables). Some of these variables are dependent; they represent the facts that a theory seeks to explain. Other variables are independent; they are the data on which the theory is developed. Thus, data are treated as factual elements that provide the foundation for any further theorizing. It follows that data observation, collection, and analysis are fundamental processes to elaborate a theory, and computational social science (high-performance computing, very large data storage systems, and software for fast and efficient data collection and analysis) has become an indispensable tool for the social scientist. This poses several challenges. Some are technical. For example, data may result from a variety of disparate sources (especially when collected through the Internet) whose reliability needs to be checked; data may be obtainable only through sophisticated processes of data mining and analysis whose accurate functioning needs to be under constant control; or the scale, complexity, and heterogeneous nature of the dataset may pose daunting difficulties, computationally, conceptually, and financially. Other challenges are intellectual, ethical, political, or indeed social. Some of the main issues that determine the initial possibility and final value of social research include quality control (e.g., timely, updated, and reliable data); availability (e.g., which and whose data are archived, and using what tools); accessibility (e.g., privacy issues and old codification systems or expensive fees that can make data practically inaccessible); centralization (e.g., economy of scale, potential synergies, the increased value of large databases); and political control (e.g., who exercises what power over which available datasets and their dissemination).

Data are the sap of any information system and any social research that relies on it. Their corruption, wanton destruction, unjustified concealment, or illegal or unethical use may easily undermine the basic processes on which not only scientific research but also the life of individuals and their complex societies depend (Brown and Duguid 2000). In light of the importance of data, their entire life cycle—from collection or generation through storage and manipulation to usage and possible erasure—is often protected, at different stages, by legal systems in various ways and in many different contexts. Examples include copyright and ownership legislation, patent systems, privacy-protection laws, fair-use agreements, regulations about the availability and accessibility of sensitive data, and so forth. The more societies develop into databased societies, the more concerned and careful they need to become about their very foundation. Unsurprisingly, since the 1980s, a new area of applied ethics, known as information ethics (Floridi 1999), has begun to address the challenging ethical issues raised by the new databased environment in which advanced societies grow.

Next post:

Previous post: