Each individual is subject to vanity

via alea:

The Leviathan model: Absolute dominance, generalised distrust and other patterns emerging from combining vanity with opinion propagation [PDF]:

“We propose an opinion dynamics model that combines processes of vanity and opinion propagation. The interactions take place between randomly chosen pairs. During an interaction, the agents propagate their opinions about themselves and about other people they know. Moreover, each individual is subject to vanity: if her interlocutor seems to value her highly, then she increases her opinion about this interlocutor. On the contrary, she tends to decrease her opinion about those who seem to undervalue her… opinion propagation is more efficient when coming from highly valued individuals…[In one] pattern, absolute dominance of one agent alternates with a state of generalised distrust, where all agents have a very low opinion of all the others (including themselves). We provide some explanations of the mechanisms behind these emergent behaviors…”

Emphasis and (egregious, unjustifiable) selective excerpt mine.
*The new and improved version of the article, following referee comments, is now available [1203.3065v2], abstract.

This article was published in The New York Times in July 2011. It is worth a second glance, especially to those concerned about research reproducibility. This was the central point:

“A Duke University program to tailor cancer treatments to certain patterns of genes has ended in disaster and lawsuits.”

Genomics is complicated

The central problem was the intricacy of the analyses in the emerging field of specialized genetics. These methods use patterns from large groups of genes to improve detection and treatment of cancer. Companies have been formed, and are already selling products that claim to use genetics in this remarkable way. Regulation has been difficult because of the complexity of the process, and because each part of the research requires different areas of expertise.

Statisticians stop a bad situation from getting worse

Fortunately, for this particular treatment, some of the genetics researchers and physicians asked two bio-statisticians to check their work. The statisticians found errors. Some were due to errors in spreadsheet tables (accidentally staggering columns or rows). These errors were not so bad. However, there were serious systematic errors that were inexplicable.

The bio-statisticians raised an alarm. No one listened for quite awhile. They even published their findings, where they could. The Annals of Applied Statistics is not on the radar for medical science, though.

Resolution

Finally, the bio-statisticians’ findings, along with other problems, converged to halt the new genetic testing and treatment program. The original research was retracted in entirety. Unfortunately, the treatment was already being used by people diagnosed with cancer. Some number of those patients would have lived if treated by existing methods.

The New York Times article does not have a pay wall, if accessed by the link in the title of my post. It is a good article. It even includes an embedded link to that Annals of Applied Statistics journal article, in free pre-print format via arXiv, as well as other supporting materials.

Wrongness of the predicting h-index paper

simplystatistics:

Editor’s Note: I recently posted about a paper in Nature that purported to predict the H-index. The authors contacted me to get my criticisms, then responded to those criticisms. They have requested the opportunity to respond publicly, and I think it is a totally reasonable request. Until there is a better comment generating mechanism at the journal level, this seems like as good a forum as any to discuss statistical papers. I will post an extended version of my criticisms here and give them the opportunity to respond publicly in the comments. 

The paper in question is a clearly a clever idea and the kind that would get people fired up. Quantifying researchers output is all the rage and being able to predict this quantity in the future would obviously make a lot of evaluators happy. I think it was, in that sense, a really good idea to chase down these data, since it was clear that if they found anything at all it would be very widely covered in the scientific/popular press. 

My original post was inspired out of my frustration with Nature, which has a history of publishing somewhat suspect statistical papers, such as this one. I posted the prediction contest after reading another paper I consider to be a flawed statistical paper, both for statistical reasons and for scientific reasons. I originally commented on the statistics in my post. The authors, being good sports, contacted me for my criticisms. I sent them the following criticisms, which I think are sufficiently major that a statistical referee or statistical journal would have likely rejected the paper:
  1. Lack of reproducibility. The code/data are not made available either through Nature or on your website. This is a critical component of papers based on computation and has led to serious problems before. It is also easily addressable. 
  2. No training/test set. You mention cross-validation (and maybe the R^2 is the R^2 using the held out scientists?) but if you use the cross-validation step to optimize the model parameters and to estimate the error rate, you could see some major overfitting. 
  3. The R^2 values are pretty low. An R^2 of 0.67 is obviously superior to the h-index alone, but (a) there is concern about overfitting, and (b) even without overfitting, that low of R^2 could lead to substantial variance in predictions. 
  4. The prediction error is not reported in the paper (or in the online calculator). How far off could you be at 5 years, at 10? Would the results still be impressive with those errors reported?
  5. You use model selection and show only the optimal model (as described in the last paragraph of the supplementary), but no indication of the potential difficulties with this model selection are made in the text. 
  6. You use a single regression model without any time variation in the coefficients and without any potential non-linearity. Clearly when predicting several years into the future there will be variation with time and non-linearity. There is also likely heavy variance in the types of individuals/career trajectories, and outliers may be important, etc. 
They carefully responded to these criticisms and hopefully they will post their responses in the comments. My impression based on their responses is that the statistics were not as flawed as I originally thought, but that the data aren’t sufficient to form a useful prediction. 
However, I think the much bigger flaw is the basic scientific premise. The h-index has been identified as having major flaws, biases (including gender bias), and to be a generally poor summary of a scientist’s contribution. See here, the list of criticisms here, and the discussion here for starters. The authors of the Nature paper propose a highly inaccurate predictor of this deeply flawed index. While that alone is sufficient to call into question the results in the paper, the authors also make bold claims about their prediction tool: 
Our formula is particularly useful for funding agencies, peer reviewers and hir­ing committees who have to deal with vast 
numbers of applications and can give each only a cursory examination. Statistical techniques have the advantage of returning 
results instantaneously and in an unbiased way.
Suggesting that this type of prediction should be used to make important decisions on hiring, promotion, and funding is highly scientifically flawed. Coupled with the online calculator the authors handily provide (which produces no measure of uncertainty) it makes it all too easy for people to miss the real value of scientific publications: the science contained in them. 

(Source: simplystatistics)

simplystatistics:

… the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and…

SimplyStatistics conceives of a way to keep would-be data analysts honest, despite increasingly easy-to-use (and abuse) statistical software applications.

(Source: simplystatistics)

My website has the most coveted audience demographic on the internet. Maybe.

VERY IMPORTANT: There is neither explicit, nor implied criticism of Quantcast data collection or reporting anywhere in the following post! Quantcast may have shortcomings, but for my purposes here (to make a point about appropriate use of data), it serves very well.

The audience profile for my “primary” Wordpress hosted blog, “Data Anxiety Asks Why”, well, just substitute my real first name for “Data Anxiety”, is comprised of graduate-school educated, very high income, young Asian men. (Just in case you were wondering, my blog has 100% child friendly content. No pornography, no off-color language, no political content, no anti-religious or religious polemic.)

Behold the demographics as of June 2, 2012:

Gender

Age

Household Income

Education

Ethnicity

Why aren’t advertisers beating a path to my door?

There are many possibilities. The most plausible is that my blog resides at a Wordpress dot com, rather than a Wordpress dot org domain (not to mention the absolute traffic counts, in contrast to the percentage values shown above).

The point of this post was to demonstrate why statistical data must be interpreted carefully, and in the proper context.

Disregarding everything else, the following chart would make me hesitate if I were looking to spend my advertising money.

Children in Household

“What’s the problem?”

Looks like those high-income, highly educated, young Asian men are childless too. That’s good, isn’t it? Lots of disposable income, all the better for buying expensive consumer electronics and personal computers and heat-generating GPU’s and maybe a new car and nice clothes and and and and…

Household data

Notice how all the demographics pertain specifically to the website audience, with two exceptions. Those two are Income and Children, which are both for the audience member’s household.

When I see that, it leads me to wonder whether my website audience might live at home with its parents, and a younger brother or sister (the “no children in household” number should be larger than it is, even if it is higher than the average number of people with no children in a home). If the audience member lives at home with its parents, that means that the Household Income value might be due partially or entirely to parental income.

Of course, the parent(s) might be stay-at-home parent(s), and the child living at home might be a working child, but how likely is that?  That sounds awful, “working child”! Remember, these are 18 to 24 year old children, so there’s nothing illegal or exploitative about that. And regardless, it is very, very unlikely.

Open data is a wonderful thing!

Yet this may be better, or equally wonderful:

A Guide to Making Data Meaningful

The Making Data Meaningful guides are intended as a practical tool to help managers, statisticians and media relations officers in statistical organizations use text, tables, charts, maps and other devices to bring statistics to life for non-statisticians.

First there is a guide about writing well-motivated stories with numbers (as opposed to abstractions). This guide is written in four languages, English, Spanish, Croatian and Japanese. A very important distinction should be made though. The versions that are not written in English were actually written by Spanish, Croatian and Japanese statisticians, not translators!

Next is a guide, in the same four languages, about how to prepare effective tables, charts, maps, and other forms of visualizations, with many examples. A highlight:

It offers advice on how to avoid bad or misleading visual presentations.

The third guide book in the series will help producers of statistics to get their message across, and communicate effectively with the media.

Attention to detail

All three books in the series are available, for free, as PDF downloads from the website. However, they are ALSO available in print version, free of charge. No self-addressed, stamped envelope is required. Postage is provided, courtesy of the UNECE, the United Nations Economic Commission for Europe.

What more could one ask for?

regionstraumapro:

When I was at Penn 25 years ago, I was fascinated to see that police officers were allowed to transport penetrating trauma patients to the hospital. They had no medical training and no specific equipment. They basically tossed the patient into the back seat, drove as fast as possible to a trauma…

Results of a comparison, in terms of mortality/ survivability  rates for patient transport

  1. as fast as possible by police, when first to arrive at scene of incident, to the nearest emergency room, versus 
  2. ambulance, by trained emergency medical service providers

Results are surprising, but ultimately, logically supported!

PLEASE NOTE

This was specifically for cases of penetrative injuries, usually gun shot wounds, without possible spinal cord complications from head or neck injury.

Sound like a special case, tiny subset, findings not broad enough in scope to be interesting? Yes, that crossed my cynical mind too. I was wrong. The comparison was done over a period of several years, with no fewer than 2,100 observations, in the same small geographic area in Philadelphia, Pennsylvania.

Interview with Amy Heineike - Mathematician

simplystatistics:


Amy Heineike is the Director of Mathematics at Quid
, a startup that seeks to understand technology development and dissemination through data analysis….

I recall reading about the oddly-named Quid via TechCrunch some months ago, and later about Amy Heineike’s achievement in snagging a much desired job among the data, mathematics and statistics crowd! Both Amy and Quid seem to be doing  well!

Background

I dug around and unearthed that TechCrunch post introducing Quid (September 2010), whose primary focus seemed to be critique of the Quid site’s typeface. (There is no reason a quantitative analysis-based company is obliged to have an ugly website!)

Better coverage was available via a more recent New York Times article (October 2011):

Quid tracks job listings, customer wins and funding valuations at first, second, and third funding rounds to give venture investors a better yardstick — and better quantitative tools — in valuing startups…In fact, Quid is an offshoot of YouNoodle.com, who offered an intriguing calculator that it claimed could predict the exit or liquidity valuation of startups.

The following are excerpts from Simply Statistics’ interview of Ms. Heineike:

What skills do you think are most important for statisticians moving into the tech industry?

Technical statistical [expertise] is the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users… A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).

Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done. Later stage teams and big tech companies may have roles that are purer statistics.

There is a real opportunity for people who have good statistical and computational skills [at a graduate degree level] to get into the startup and tech scenes now. Getting involved in an open source project, working with version control in a team, or sharing your code on Github are all good ways to start.

Its really important to be able to show that you want to build products though. Imagine the clients or users of the company and see if you get excited about building something that they will use.

Go ahead and read the full interview on Simply Statistics tumblr.

Make sure to leave enough time to have a look around in general, as it is a fine tumblr for the mathematically and statistically inclined, well, for those with practical inclinations!

Constants of R

adamlaiacano:

Of all of the mathematical and scientific constants, they decided to go with lower/upper case letters, month names/abbreviations, and pi.


Constants {base} R Documentation

Usage

LETTERS
letters
month.abb
month.name
pi

R has a small number of built-in constants. The following are available:

  • LETTERS: the 26 upper-case letters of the Roman alphabet;
  • letters: the 26 lower-case letters of the Roman alphabet;
  • month.abb: the three-letter abbreviations for the English month names;
  • month.name: the English names for the months of the year;
  • pi: the ratio of the circumference of a circle to its diameter.

I agree, this does seem like an odd choice.

First thought that crossed my mind: Why pi but not e?

(Source: adamlaiacano)

Security of Statistical Databases

A statistical database (SDB) is a database that is used to return statistical information derived from the records to user queries for statistical data analysis.

Sometimes, by correlating enough statistics, confidential data (stored in an SDB) about an individual can be inferred. Examples of confidential information stored in an SDB might be salaries or data concerning the medical history of individuals.

An important problem is to provide security to SDBs against the disclosure of confidential information. An SDB is said to be secure if no protected data can be inferred from the available queries.


SIAM J. Discrete Math. 25, 1778 (2011)

http://link.aip.org/link/doi/10.1137/070689589*

Now THIS is an example of a network effect! Well, it isn’t truly an example of a Dijkstra style network, to be honest. It is mostly a simple example of information aggregation, at scale, leading to “de-obfuscation” of protected personal data.

It is the sort of thing that most of us should be concerned about, as much or more so than shadow groups pulling the strings toward mysterious nefarious ends. Lack of security of statistical databases is a near and present danger to individual privacy rights.

This is not an example of my beloved Eugene Derman’s recent (and wry) identification of the trend toward applying the concept of “SUPER SYMMETRY” to everything, particularly financial services-related and Big Bad Banker evil.

Nor is it a misuse of correlation either. Another of my unsung hero’s, the venerable Scotty Barber** of Reuters Graphics, has sometimes despaired of seeing an appropriate use of correlation that is not associated with the fallacy of

  • correlation => causation

again, as it so often observed in economic and financial services news of late. I was of the same mind, to tell you the truth. Starting to despair of EVER seeing correlation coefficients used in a way that was free of pejorative connotations. My faith is restored!

* SIAM et al. = Society of Industrial and Applied Mathematics, Journal of Discrete Mathematics

It has the most appealing blue and green header image of any journal with which I am familiar. Have a look! Give it a click (after reading what follows, if you would be so kind).

** Scotty Barber does data graphics, he is NOT a data visualization abuser. That is running rampant of late. And it IS possible to do data visualizations that are not abusive of the idea. I recommend Manuel Lima in that regard. He does fine modern data visualizations, without abusing the concept. Manuel Lima’s work was the first data visualization I had ever seen.

Alert: Manuel Lima is now one of us, here on tumblr of late, at Visual Complexity.

(Parenthetical aside

  • Why does Reuters’ employee Felix Salmon have an assistant, but (also a Reuters’ employee) Scotty Barber does not? Maybe I am wrong, and Scotty Barber as well as his assistant(s) are just as modest and understated as he is.
  • For that matter, why doesn’t Reuters give Eugene Derman, Ph.D. (whether an employee or not) an assistant too? And far more editorial support when he creates original images, which have aesthetic merit in my opinion, about pleasure and pain, to accompany his Reuters posts? Which would you rather see:
Visuals of pleasure and pain
OR
Felix Salmon’s daily list of bookmarks?)

Note bene use of “modern”. I know about Edward Tufte. I met Edward Tufte before he published his first book. I sat through a luncheon at Swarthmore College in 1983 (or 1984?) where Tufte warned all five of us around the table (senior class math and statistics majors)

be careful, don’t dip my book in the gravy
while we ate, because
I had to take out a third mortgage on my house to get it published
And a wonderful book it turned out to be, I fully realize! So Tufte is not quite “modern” anymore. Well maybe he is modern. Perhaps the recent data visualization trend is “post-modern”? Yes, I know it is Web 2.0, but that is becoming dated as well.

Be all this as it may, this was a nice little article. It also appeared, I am afraid, in the now tabloid-of-the-physical-science press, PhysOrg. But I still read PhysOrg, and The Sun and The New York Post.

*** SIAM is the only academic journal in which my name ever appeared (very long ago) and is associated with the only prize for academic excellence I ever received. Thus my continued loyalty to SIAM, til death do us part.

Okay, enough original content. The arbiters of SEO have surely been appeased. No burnt offerings need be sent to Mountain View… although a nice roasted chicken with paprika couldn’t hurt. I’m contemplating cooking such later on today. Roasted chickens with paprika are not the reason for my Page Rank of 5, by the way. I don’t think…?

Time to CREATE POST!