Pundita: Ignorance, Knowledge, and the Three Strikes Rule

I asked regular reader and contributor "Liz" to consider doing a guest essay on a topic of her choice. She responded with a writing that analyzes reporting on the mysterious 'pig illness' and teaches the difference between information and data and qualitative and quantitative data.

Dare I use the expression "hog heaven" to describe my delight at the topic she chose? I hope that news consumers, journalists and intel analysts take notes while reading. They'll be rewarded countless times over for their brain sweat.

Being clear on the differences Liz points out is a kind of magic lamp. The lamp illuminates the murkiest news reports, shines through the thickest smoke blown by officials, and spotlights the exit in a hall of mirrors built from half truths.

This said, there is a flaw in Liz's conclusion. In a world where information has become a powerful weapon, and during the course of a very hot war, we can't always eschew speculation in the face of unreliable or very incomplete data, any more than we can always delay strong action in the face of little knowledge.

What we can do, and what I've demonstrated in a series of recent essays, is to seek data that relates to goal orientation rather than acquistion of knowledge.

Thus, while we don't know much more about the mystery illness than when the story first broke, we now know a great deal about how China's government has responded to the outbreak and how they've chosen to communicate about it. From this we can conclude, at the least:

This makes the third time in two years that China's central authority has lied themselves blue in the face to the American government about a lethal outbreak of illness in Mainland China.

This makes the third time in two years that US congressionals have not called for a formal US protest to China because of Beijing's refusal to share critical health data with the CDC.

This makes the third time in two years that the US Department of State has not issued a ban on travel to China in response to Beijing's refusal to share medical data with the CDC about a lethal illness outbreak in China.

All this suggests a clear action path: at the voting booth, in letters to editors and congressionals, in calls to companies that do business in China, and while shopping for vacation spots and imported items.

For there must be some means to get across that we're not sending our soldiers to battle against suicide bombers only to see our Congress, foreign office and leaders of industry encourage suicidal behavior.

By all means necessary, Beijing and Washington must be brought to understand that there will not be a fourth time.

With that said, I give the floor to Liz.

Data != Information
The above statement is actually a misuse (or perhaps abuse would be the better term) of a numeric operator in the Perl scripting language, but it’s a handy way to make an important point: data does not equal information.*

What difference does it make? Consider the recent many posts from Pundita and others concerning the mysterious deaths in China. Were the deaths caused by a disease or combination of diseases; evolved or old disease? Or by industrial pollution or toxic wastes? Or some combination of the above?

Pundita has reminded us, early and often, not to assume too much from the limited reports we have, especially since later reports, like those from any totalitarian state, are manipulated to serve the purpose of the manipulators (who may or may not be those in power at the moment).

What we have is not information, it’s data, and it’s pretty darned sketchy data, at that.

The difference matters, because it profoundly affects what we ought to do. Data, to use the readily available Merriam-Webster On-line dictionary is:

1 : factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation: "the data is plentiful and easily available -- H. A. Gleason, Jr.;" "comprehensive data on economic growth have been published -- N. H. Jacoby."

2 : information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

3 : information in numerical form that can be digitally transmitted or processed

The term information, on the other hand, is not as crisply defined; it has several meanings, many related to mathematical values in measuring reliability of content. However, we do have one relevant definition:

2 a (1) : knowledge obtained from investigation, study, or instruction.

Information is data that has been analyzed, processed, and/or evaluated for the value of its content. Take ten students’ heights and weights -- data -- and average them, plot them on a graph (weight vs. height); collect more measurements, from more students, and separate by gender. Is that more important than height in predicting weight? Does their age matter?

You can answer those questions when you have sufficient observations, because you can legitimately mathematically manipulate the data -- meaning can be derived from it, turning it into information.

But the process is easy to misuse, just as I misused the Perl operator above, when we’re intellectually a little bit lazy. Here’s an example, all too common, from my early statistical analysis training:

We have all taken surveys which asked us to rate something on a scale, where "value_1" means Disagree Strongly, "value_2" means Disagree, "value_3" means Neither Agree nor Disagree, "value_4" means Agree, and "value_5" means Agree Strongly.**

There’s nothing inherently wrong with the construction of such a survey, though the wording of the statements you’re being asked about has a fundamental effect on how you’ll answer. The problem arises when we attempt to turn the survey data into information.

Suppose I used 1 for "value_1", 2 for "value_2", etc. That does not mean the responses have any numerical value. They cannot legitimately be summed, averaged, or have any other mathematical operation performed on them.

If that seems a little difficult to accept (and it is, initially, for almost everyone) think if it this way: we have a spectrum of colors, generally considered as running red-orange-yellow-green-blue-violet in a longer-to-shorter wavelength progression. I could just as easily have assigned red as "value_1", orange as "value_2", etc., through blue as "value_5" (and I could have used six values had I wanted, just to nicely map to the color spectrum).

How could I sum the responses, or average them, or take a standard deviation of the responses from the mean answer to a particular statement? Could I really say the average response to statement 1 was teal, and the average response to statement 2 was indigo? Where does burnt sienna lie on the scale? Of course, people mathematically manipulate these kind of results all the time, but that doesn’t make it right.

This kind of data is called qualitative, as opposed to quantitative; we know certain characteristics about the observations, but those known characteristics have no mathematical meaning.

Qualitative data may be represented as existing in a given category; you may count the observations in a given category and compare that to the number of observations in another category, but in the end, the data itself is nothing more than a “yes” answer to the question “does this fit in bucket A?”

The doctor asks if you have a headache; the answer is yes or no – it’s a qualitative answer. Your body temperature is quantitative – it’s measurable on a standard scale, with known reference points (which differ slightly, depending on where the observation is taken, orally/axially/anally, for instance).

Because temperature is quantitative data, I can take measurements -- data -- from many patients on admission and report a mean fever, with a standard deviation (a measure of how widely the observations vary from the mean).

With headache, I can only aggregate the observations into yes/no categories, and report how many or what percentage of patients fell into a category. I can’t report how strong or debilitating their headaches were.

The physical and biological sciences (and engineering) emphasize quantitative data, while the social sciences are forced to use qualitative data, since we frown on using humans as experimental subjects.

We have no quantitative data on the mystery deaths in China. We don’t have any source of measurable data, such as blood samples, because the Chinese government has chosen not to provide them to outsiders. We don’t even know how many deaths, much less the mortality rate of all those who experienced the illness.

Nor do we have any means of measuring the effect of intensity of exposure, or the existence of mitigating or exacerbating factors such as age, nutrition, or gum disease. (Open wounds in the mouth can be an entry vector for disease).

In short, we don’t know much of anything. We have some qualitative data: persons were reported to have experienced these symptoms or behaviors, in those conditions. But quantitative data (e.g., How many people infected? How long after exposure did they become ill? What was the extent of exposure?) as reported in the media is sketchy, missing or contradictory.

Nor has data been validated by proper scientific methods (and proper scientific methods may not arrive at the truth quickly every time, but they have a better track record than nonscientific methods).

If we have no quantitative data with which to use the usual methods of medicine (epidemiology, pathology, toxicology, as well as diagnostic tools), can we apply the methods used by the social sciences? After all, we have qualitative data, and that’s what the social sciences have to use.

The answer is essentially “no.” We have a few incidents, but we don’t even know, with any certainty, how many observations we have. We actually know relatively little about the incidents, and not enough to determine the categories into which to place the observations.

For instance if we knew always knew where, with any reliability, we could create a category for proximity to factories known to use certain chemicals, or proximity to animal rendering.

If we knew how many, we could estimate morbidity (how many got sick of the local possible number) as well as mortality. If we knew in all cases about the victims, such as occupation, we could categorize by exposure to possible sources.

But we have very little information of that kind. We know, in fact, very little. And while a little imagination can go a long way in the creative arts, it’s dangerous to make foreign policy based on imagination. It’s dangerous to do nothing based on imagination, too.

So we shouldn’t ban travel to China (foreign policy) and we shouldn’t ignore the problem (do nothing). We should be working to get more and better quality data --data we can legitimately turn into information.

One more thing: Until we have good data, in sufficient quantity to create information, we should refrain from speculation. There’s more than enough of that going around, already.

* Since I’m using strings, I should use “ne” instead of “!=”, which is properly used for numeric rather than string (written language) comparisons -- but it’s more attention-getting this way!

** This kind of answer system is known as a Likert Scale; it’s often misused by amateurs when conducting surveys -- a misuse often coupled with statements constructed to lead the response in a desired direction.

Pundita

Translate

Sunday, August 21

Ignorance, Knowledge, and the Three Strikes Rule

No comments:

Blogroll