Comments on “How should we evaluate progress in AI?”

Comments

Machine learning that matters

Interesting post, summarising many of my own perceptions. I especially like your call for the researcher to maintain contact with the messy real world. This reminds me of Kiri Wagstaff’s “Machine Learning that Matters” at ICML 2012.

Machine learning that matters

That’s a nice paper, thanks, I hadn’t seen it!

Here’s a link for the convenience of other readers.

Has this issue gotten any better or worse since 2012, do you think? (I don’t follow ML closely enough to have a sense of that.)

Interpretability, transparency, explanations

I would say the we are still inundated with work that only uses unreliable proxies of real-world usefulness. It is very easy to build a model, find some data, compute some numbers, and run a variety of statistical tests to claim success. It is far harder to explain what a model has actually learned to do, and demonstrate whether it is even useful to anyone. My research in music information retrieval (MIR) echoes this and what you have said: B. L. Sturm, “Revisiting priorities: Improving MIR evaluation practices,” in Proc. ISMIR, 2016.

However, there are efforts advocating for interpretability and transparency and explanations in ML: see the Interpretable ML Symposium; and Fairness, Accountability, and Transparency in Machine Learning; and HORSE. And there is pressure from governments in ensuring the fair application of such algorithms in society, e.g., the recent UK Parliament Science and Technology Committee’s call for written evidence of “Algorithms in decision making”.

Other papers of interest:
D. J. Hand, “Classifier technology and the illusion of progress,” Statistical Science, vol. 21, no. 1, pp. 1–15, 2006.
C. Drummond and N. Japkowicz, “Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning,” J. Experimental Theoretical Artificial Intell., vol. 22, pp. 67–80, 2010.
B. L. Sturm, “A simple method to determine if a music information retrieval system is a “horse”,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1636–1644, 2014.

This Is The Best Portrait Of "The AI Mangle" I've Ever Read

JenniferRM's picture

This is a fantastic portrait of real work within the field of study historically called “Artificial Intelligence”, and I feel that it does a wonderful job portraying both the positive and negative aspects of work in this field.

You have an insider’s grasp of the stories and details, and yet time has given you an outsider’s clarity of detached perspective :-)

I think it is extremely useful to notice that of the six dimensions of practice in AI that you point to (Science, Engineering, Mathematics, Philosophy, Design, and Spectacle) none of these dimensions are really essentially about “AI”.

I think maybe what you’re doing in a deep sense is talking about how modern science works in real life in nearly every field of study with aspirations of technical correctness, but you’re maybe doing this through the lens of a field you know very well.

Based on my experience in several different fields, what you’re saying is quite similar to how basically ALL of them work in real life. It is very common for people to start out wanting to do research that will “cure cystic fibrosis” and end up obsessing over yeast genetics for years. With just a little bit of squinting, this dynamic appears quite similar to the toy models produced as ground breaking GOFAI projects, like Shrdlu, Copycat, Sonya, and Eurisko.

To aim for more generality: I have a long running hobby interest in producing a “cliology of science”.

Cliology in general would function intellectually as a quantitative, predictive, and manipulative theory of history in general. It would compressively explain and largely “post-dict” the past, and it would “pre-dict” the future in a probabilistic way, helping people to build forecasts of the future that gave theoretical reasons for empirical measurement of the determinants of the future that would be hardest to compressively explain and most in need of simply measuring them and taking them as historical givens. Naturally this project is probably impossible in general, but as a research prompt it offers a glorious vision that leads one into many “rabbit holes” of learning :-)

A cliology of science would merely attempt the restricted and hopefully easier task of developing a theory like this which works specifically on “history of science”. Naturally this project is probably impossible in general, but it is likely to be literally the hardest part of general cliology, and so figuring out why it is impossible might shed light on the nature of history itself, potentially accomplishing something that would seem, as you say, “snazzy” ;-)

Anyway, this is backing raised here in support of explaining why, if you haven’t heard of him, Bruno Latour might be relevant to your interests. His book “Science In Action” is on my list. When I search now I see another probably relevant book, with a full online PDF: Pickering’s “The Mangle of Practice”

I’ve not read much of Latour directly (because life is finite), but I have read summaries and have an “imaginary Latour” in my head whose central idea seems to be the thing you have just described about AI work over the last 50 years!

Basically there is this thing Latour calls “the mangle” that happens in the workflow of serious knowledge workers. They have tools and techniques, and they also have theory. Their job is to use the tools and techniques to show a thing they theory says should be possible that no one has yet done before. Every small empirical failure in the lab has two possible interpretations: (1) the investigator is incompetent and applying their tools and techniques with insufficient diligence or funding or (2) the theory is wrong.

Showing the thing is taken as incremental confirmation of both the theory (especially the theory’s scope) and also confirmation of the investigator’s adequacy as an investigator who has mastered the tools and techniques of their field.

From the perspective of the lay public, investigatory adequacy is so fundamentally assumed that it almost never arises in discourse in any thoughtful way. To raise this issue well is basically a kind of heresy, attacking the integrity of a Real Scientist and implicitly potentially attacking the ecclesiastic authority of Scientists In General.

From the perspective of a bureaucrat making fast decisions in a funding agency the expectation is that current competence correlates with past success at either an institutional or personal level. A great proposal might screen off prestige indicators, but serious prestige failures for a great proposal would probably cause the proposal to be read as a sophisticated lie. Then they basically fire and forget. A grant that produces no positive result weeds out the investigator as a reasonable winner of future grants. (Implications for the incentives to game this system are left as an exercise for the reader.)

However, from the perspective of the individual scientist, the question posed by the mangle is a serious and urgent and live thing.

Early in one’s career it arises very regularly, and often in the negative, as one discovers more techniques that genuinely haven’t been mastered yet, and are needed to accomplish a project. The western blot keeps not coming out and maybe you ask someone for help who knows the trick or maybe you just keep debugging it yourself… You have trouble figuring out just the right u-substitution to attempt to actually find the integral of a complicated equation…

The beginning of a general mathematical model here would be the “hope function”, which has been wonderfully written up by Gwern. There is a chest with N drawers (techniques), an object (theoretical confirmation) is either in one or none of them, and as drawers are opened and found empty the probability goes up both for other drawers as well as the possibility that the object is elsewhere. Extending this to science, and specifically to the mangle, the drawers are investigatory techniques, and the investigator’s skill is a parameter that probabilistically determines the chance of a drawer seeming empty but actually being full.

When all drawers have been exhausted and no theoretically predicted observation has been seen it suggests going back and double checking drawers a second or third or fourth time.

In real life, over a real career, investigatory skills presumably go up over time to some peak (and then begin to decay when the reseracher leaves the lab). However, with every subsequent new “piece of science” that is genuinely new, requiring a new application of a previous skill with additional new demands, the question arises again. Nobel winners rarely win a second prize, because luck is a major component here. Marie Curie got two, but she used essentially the same technique for both: boiling a huge amount of slightly radioactive ore down to its maximally radioactive essence. There turned out to be more than one such essence, hence more than one Nobel was awarded! She was quite a lucky scientist ;-)

I would propose that “a rationality” is similar to a “technique” or a “drawer” here: a tool of clear shape and purpose with limits that make it inherently non-universal in scope.

Then “a meta-rationality” would be similar to a field of study where people gain specific skills, and also a general meta-skill at navigating “the mangle” relative to a shared overarching vision of what the field is even supposed to be aiming for.

The implication is that to understand reality in general it may be necessary to have a tragically duplicative “meta-meta-rationality”.

The cultural diversity of fields is clearly visible, and I think often attributed to reality itself having structure which can be broken into parts and analyzed separately. However, I am personally suspicious of this. It seems likely to me that reality is “one big thing” and that the lines between different fields (that expect mastery of different techniques to demonstrate answers to different questions) are often at least somewhat socially contingent responses to the tragically short lifespans of humans and the complex political realities of scientific funding and scientific self promotion.

I would suggest that perhaps there is another layer here, which this essay on “AI” strongly points to a piece of, which are dimensions of variation between fields, that might help someone interested in reality itself figure out which fields they should be aware of and borrow from in order to achieve various goals a knowledge worker might have.

Here I come back to your breakdown, which I find fascinating, and want to extend a bit into a list of possible field-level virtues:

  1. Truth as pursued by Science in general.
  2. Technical Utility as efficiently deployed by Engineers from various fields.
  3. Abstract Rigor as embodied most in the field of Mathematics.
  4. Meaning as verbally clarified and discovered through exploratory Philosophy.
  5. “Snazziness” as creatively embodied in a process of Design.
  6. Influence which is classically achieved via Spectacle, with the aim of recruiting resources into a grand endeavor.

Now my personal approach to science is descriptivist and opportunistic. My cliology of science hobby let’s me look at science itself like a bug under a microscope without caring much about it. Also, I like fields that offer me something useful, and I mostly just want to know what they are really like so that I can find useful things faster when a field is relevant to my interests.

However I don’t personally find it meaningful to exhort scientific fields to be different than they actually are.

I might dis a children’s book for being long and boring, or dis it for making children cry. I won’t dis a children’s book for being childish however. That’s essential to the idea of a children’s book! Similarly I’m not going to complain about romance novels being full of unrealistic relationships. Similarly I’m not going to complain about Science Fiction showing an astonishing amount of political agency put in the hands of scientifically literate technicians.

I feel like maybe academic or scholarly fields (to the degree that they don’t get government funding) are like books from different genres. The scholars in them want what they want, and have made their own choices about the pursuit of truth, utility, rigor, meaning, and snazziness. They only have so many “points” to spend on the formation of scholars in that field and tradeoffs are inevitable…

Even spectacle (or lack thereof) I can forgive. Basically the only thing I see as a sound basis for complaining about a field of study would be that it might be funded by taxes collected by governments on pain of punishment, and then spent in a way that fails to benefit (or even hurts) the public from whom the taxes were taken. This isn’t even a thing that a field deals with. Fields span countries, living in the space of the mind.

You could talk about the Canadian Ecology Establishment or the British Physics Community and how each may or may not be funded at the appropriate levels, and in the public mind I’d expect this to be related to their production of ecological spectacles visible to Canadian taxpayers or physics spectacles visible to British taxpayers… but this is going to be even MORE “inside baseball” than the question of how researchers should personally deal with the mangle in their field ;-)

In the meantime, as an aspiring meta-meta-rationalist I think it is helpful, when looking at fields, to separate the tradeoffs that I might abstractly wish it had made (so that it would be useful for me as a sort of idea thief) versus the tradeoffs it actually made (for its own internal productive reasons which may have causes that can be scientifically analyzed).

You taking an prescriptive interest here sort of makes me wonder… Are you thinking of going back into the AI laboratory with a new theoretical perspective and new stance towards “the mangle”? That would be pretty awesome :-)

On tools vs. agents, and machine learning vs. statistics

Your comments about dishwashers and demos reminds me of the argument I made about tools and agents in my two posts here. IMO, one of the biggest problems with current AI hype is the focus – not in the research literature, but in the popular conversation – on fully autonomous systems that can act without human supervision. This doesn’t even make sense as an overly optimistic take on the direction of the research, since most deep learning (even the really impressive stuff) is focused on mapping inputs to outputs on circumscribed tasks without a full perception-decision-action-result loop, and deep reinforcement learning has not yet had practical successes outside of game playing (and not for lack of trying). Maybe people are responding to AlphaGo and AlphaGo Zero, attributing their successes to generic “deep learning” and ignoring the decision-making element. I don’t know.

The other thing I wanted to say was about this:

“Data science” is, in part, the application of AI (machine learning) methods to messy practical problems. Sometimes that works. I don’t know data science folks well, but my impression is that they find the inexplicability and unreliability of AI methods frustrating. Their perspective is more like that of engineers. And, I hear that they mostly find that well-characterized statistical methods work better in practice than machine learning.

I work as a data scientist, and much of this is pretty good characterization of what I do and think. But I don’t agree with the conflation here of “AI” and “machine learning,” nor with the assertion about “well-characterized statistical methods.” In practice, a “data scientist” is usually something like an applied statistician who works entirely within the “algorithmic modeling” culture described by Breiman in his “two cultures” paper (which has probably influenced my thinking more than any other single academic paper, BTW). That is, they are concerned with making predictions (or making other productive uses of data), not with interpretation per se; while we care about interpretability a lot, it’s mainly as a means to other ends. We don’t care about what a given coefficient is (estimation), only what it does. This leads to a “whatever works” attitude about model classes, and we will use complicated/fancy model classes with no estimation or hypothesis testing frameworks as long as they perform well in way we can in itself statistically validate.

I find terms like “AI” and “machine learning” unhelpful here, as it’s not clear to me where the boundaries are supposed to lie, and it’s in the very nature of my work to use everything from linear regression to fancy deep neural nets depending on my use case. I do find that some of the cutting-edge deep learning stuff is not very useful in practice, but some of it is, and the useful / not-useful lines I tend to draw are not easy to match up to the boundaries of concepts like “machine learning.” And while deep learning methods are very data hungry (as you say, they are much like fuzzy look-up tables of massive size) and costly to create for new tasks, they’ve grown to be mature and reliable engineering components for some very standard, broadly applicable ones like parsing.

Horses in machine learning

Bob, thank you—that’s a really interesting and useful set of citations! The Clever Hans analogy is a compelling one. In addition to your own publications, I found this one from David Hand particularly clear.

Whatever works

nostalgebraist, thank you for the comment! I’m a lurking reader of your tumblr, by the way, and generally find myself in violent agreement and/or admiration.

“Let the humans do what they are good at, and the machine do what it’s good at” seems like a sensible principle. The first AI project I worked on was the Programmer’s Apprentice, which took that approach (but, sadly, didn’t get very far).

In a perhaps-odd coincidence, Breiman’s paper was recommended to me by someone else yesterday, and I did a quick skim then. I haven’t read it properly… but I guess I’d say that, so far, I find it unsatisfactory. I would like statistics to be both interpretable and predictive. This is easy for me to say as a non-statistician! But given a forced choice between “interpretable but not very predictive” and “predictive but not interpretable,” my opinion is “neither, thanks! can’t you guys do better than that?” This is highly unrealistic, obviously, because in many cases the choice genuinely is forced, and then reluctantly I guess I agree that predictivity usually should win, but it doesn’t make me happy.

A prescriptive interest

JenniferRM, thank you for the exceptionally thoughtful comment! Generally I agree with it strongly. And, the way you are talking about “rationality” appears identical to the way I think about it (which is not universally the case).

Latour and Pickering were already on my must-read list, and I’m embarrassed that I haven’t yet. The Hope Function piece is new to me, and now on the list!

You are right that my taking a prescriptive attitude is evidence of my still considering myself an AI insider, to at least some extent. (This surprises me.) Since January 2014, when I first read about the spectacular ImageNet results, I’ve been half-wanting to jump back into active participation.

But, I think it’s possible to be prescriptivist about fields you are not part of, when you notice they aren’t living up to their own rhetoric. If the Canadian Ecology Establishment spent almost all its time arguing about what shade of beige to paint its conference rooms, then I could legitimately criticize them for that.

On the other hand, if they are emphasizing descriptive field case studies over statistics, or vice versa, my opinion about that is probably not worth anything.

Clarifying the two cultures

In a perhaps-odd coincidence, Breiman’s paper was recommended to me by someone else yesterday, and I did a quick skim then. I haven’t read it properly… but I guess I’d say that, so far, I find it unsatisfactory. I would like statistics to be both interpretable and predictive.

Hmmm… reading over my comment again, I think I didn’t really do justice to Breiman’s view (or my own) when I said algorithmic modelers like me don’t care about interpretation for its own sake. The divide I’m trying to get at is more subtle: it’s about whether you assume at the outset that the real phenomenon follows patterns you know how to interpret.

A lot of the statistical work that gets done in the world follows the procedure “perform linear (or logistic) regression no matter what the data set looks like, do hypothesis tests on the coefficients, draw conclusions about the phenomenon from these tests.” This is a bit of a caricature, but honestly most of the social and medical science I read does exactly that. There is sort of a “convenient-world assumption” here where you take a certain kind of easily interpretable pattern (linear relationships), assume all patterns in the data are of that form, and run a pattern-detecting procedure that depends on that assumption. If the assumption is not valid, this can result in missing real patterns but also seeing unreal ones.

What Breiman and I prefer is to try a bunch of models, aiming at predictive performance, and then do interpretation afterwards. When I do start to interpret, I’m not locked in to a single framework and betting everything on its (approximate) truth. Instead I have a bunch of frameworks, and a sense of how much it “costs” in predictive terms to make the assumptions inherent in each one. Sometimes the only model that predicts well is a fancy one I can’t interpret – but that has interpretive value in itself, telling me that my data is really that messy, and that I would be fooling myself if I made a convenient-world assumption. I’m letting reality tell me how easy it is for me to understand, rather than assuming it is easy for me to understand.

Three statistical cultures?

Thanks, this is a helpful clarification!

I seem to be developing strong opinions about statistics, and intend to write about them in The Eggplant. This is absurd, because I don’t actually know anything about the subject. However, it’s a free country, and if I choose to make a fool of myself in public, no one can stop me.

Here are three possible ways of proceeding (all perhaps oversimplified to the point of caricature):

The first culture is one-size-fits-all. It prioritizes ease of use. You have your standard method/model/paradigm, you get some data, you run the data through your thing, it says p<0.05, and you are done. (This is, as you say, what the typical working scientist does.)

The second culture has many methods/models/paradigms, and no a priori commitment as to which to use. It prioritizes predictivity. You run your data through several, find what fits the data best, and you are done. (This seems to be what a lot of ML and data science people do. If my skim of Breiman is correct, this is also what he advocates.)

The third culture also has many methods/models/paradigms, but prioritizes explanation. It wants to discover the data-generating process in the real world (not the shape of the data you’ve got). You are willing to sacrifice some predictive accuracy if there’s reason to think that a less predictive model is more mechanistically accurate. (This seems to be what, e.g., Andrew Gelman advocates, if I’m reading him correctly.)

As a naive outsider, I favor culture 3 (if it’s even a thing). Clearly, it’s not always successful (but then, neither are the other two). If you can’t get an explanatory model, a predictive one is still good!

My current reading in stats aims to gather an understanding of how culture 3 works—how you go about guessing at, and then testing, a theory of the real-world process.

Does my three-cultures story seem to have anything to do with reality? (And, if so, can you recommend things to read that might clarify or support the story?)

Latour

Bit off topic but relating to Latour… I read his ‘Why Has Critique Run Out Of Steam?’ recently which I think you would like. He covers some similar territory to you on the dangers of half-digested pomo ideas, and talks about how this kind of critical theory pushes the rich world of real objects into two impoverished categories:

We can summarize, I estimate, 90 percent of the contemporary critical scene by the following series of diagrams that fixate the object at only two positions, what I have called the fact position and the fairy position…

The fairy position is very well known and is used over and over again by many social scientists who associate criticism with antifetishism. The role of the critic is then to show that what the naïve believers are doing with objects is simply a projection of their wishes onto a material entity that does nothing at all by itself…

But, wait, a second salvo is in the offing, and this time it comes from the fact pole. This time it is the poor bloke, again taken aback, whose behavior is now “explained” by the powerful effects of indisputable matters of fact: “You, ordinary fetishists, believe you are free but, in reality, you are acted on by forces you are not conscious of. Look at them, look, you blind idiot” (and here you insert whichever pet facts the social scientists fancy to work with, taking them from economic infrastructure, fields of discourse, social domination, race, class, and gender, maybe throwing in some neurobiology, evolutionary psychology, whatever, provided they act as indisputable facts whose origin, fabrication, mode of development are left unexamined)

There’s a lot more going on in the essay beyond this… not sure I can really summarise it.

Running out of steam

drossbucket, thank you! By odd coincidence, I read that about two weeks ago too. It had been open in a tab for a couple of years. (I think Mike Travers recommended it to me originally.)

I found it good and important, and thought I should write a short summary of it, or commentary on it, or something; but (as JenniferRM notes) life is finite…

One manifestation of which is that I have 1078 tabs open with things I “really ought to read”… [← actual number as of now]

David – yes, you are right to

David – yes, you are right to distinguish three cultures/approaches. When it comes to doing science, we do need something the second culture lacks: persistent mechanistic hypotheses that get tested, discussed and approved over time. This is also lacking in the practice of first-culture science, as I lamented here (but see also my exchange with discoursedrome in the notes).

But even here, the second culture has some lessons that ought to be heeded. You write

It wants to discover the data-generating process in the real world (not the shape of the data you’ve got). You are willing to sacrifice some predictive accuracy if there’s reason to think that a less predictive model is more mechanistically accurate.

There is a danger here. One reason that second-culture stuff has been so popular recently is the raw predictive success of the complicated, less interpretable models it embraces. I read that success as evidence for something that always seemed plausible to begin with: outside of physics and chemistry, reality’s “data-generating processes” are vastly more complex than the sorts of things humans tend to come up with off the tops of their heads, when playing the guess-and-check-and-guess-again hypotheco-deductivism game. This is a clue about reality, not a sacrifice of reality in favor of mere “prediction.”

If we’re following this line of thought, we ought to look closer at what these “second-culture models” do. If we do, we will see a striking feature: they generally do not look like models of any data-generating process we would expect in reality. The classic workhorses of data science are models that average together lots of decision trees, like random forests (Breiman’s invention) or their even more formidable sibling, boosted trees. These do fantastically well, but no one thinks reality consists (mechanistically) of decision trees ensembles. The same goes, of course, for neural nets. Calling these things “models” of the data is actually kind of misleading; it might be better to call them something like “perceptual systems.” No one expects to peer inside an organism’s visual system and “read off” the physics of light and the properties of ecologically common materials, but this is no count against vision. (The fact that this distinction is often of no practical importance to the practicing data scientist perhaps explains why statistics and AI have gotten so curiously blended together in that field.)

So, if we care about reality, we may have the following worry. What if, in many subjects of interest, our mechanistic hypotheses are just too simple? What if, in eschewing the “second-culture models,” we are like scientists trying to understand the relation of retinal activations to facts about the world, who go on proposing and testing crude hypotheses about the mean values of the activations or their standard deviations, when it might be better to learn to see, and only then to interpret what happens in the (massively complex) process of sight?

Gelman and his paper with Shalizi

Computer Nerd's picture

Unless I misunderstand you, I don’t think Gelman falls into group three, which seems like it would rule out “wrong but useful” models. He wrote a really, really good paper with Cosma Shalizi, Philosophy and the Practice of Bayesian Statistics, which is worth reading for anyone.

A key quote:

We are not interested in falsifying our model for its own sake – among other things, having built it ourselves, we know all the shortcuts taken in doing so, and can already be morally certain it is false. With enough data, we can certainly detect departures from the model – this is why, for example, statistical folklore says that the chi-squared statistic is ultimately a measure of sample size (cf. Lindsay & Liu, 2009). As writers such as Giere (1988, Chapter 3) explain, the hypothesis linking mathematical models to empirical data is not that the data-generating process is exactly isomorphic to the model, but that the data source resembles the model closely enough, in the respects which matter to us, that reasoning based on the model will be reliable. Such reliability does not require complete fidelity to the model.

The really interesting point that has stuck with me from the paper is that, having designed a model, one should treat actually fitting it as a kind of “principal-agent” problem. Treat the model as your agent drawing conclusions from the data, and yourself as the principal. The model has priors, not you, but you set its priors so the model won’t draw conclusions in “bad faith” (like overfitting). And hopefully, you then check that the conclusions make sense (if you’ve got a generative, Bayesian model, one way is just to draw a bunch of samples and see if they’re distributed like your observations).

Looks more effective != is more effective

Anonymous Coward's picture

One of the later sections (“Spectacle”) was more relevant than you might have realized.

A story I heard directly from the former head of engineering at a major appliance manufacturer many years ago: the research lab had come up with a new spray design that looked like it would be much more effective in cleaning dishes. The design had been transferred to engineering and the lab was working on measuring just how much more effective the new design was. As designs were completed and moved to manufacturing, the marketing people were brought in to figure out how best to sell the new capability.

Unfortunately, not long before the product was scheduled to be introduced, the tests were finally completed. The new spray was not more effective than the old one – it looked great, but it didn’t clean dishes any better. And by this time, they were starting to make the new dishwashers and it was too late to cancel the advertising space.

They did the only thing that they could do: without claiming that it was more effective, they showed the spray in action and talked aboput how it was more powerful. They didn’t lie, but they didn’t tell the whole truth – it wasn’t any better at cleaning dishes than the old design.

I wonder how many AI systems are like this….

There was a [recent article] (https://www.theguardian.com/technology/2018/jul/06/artificial-intelligen…)
on the “AI” services that use people behind the scenes to supplement, fix, or provide the entire service.

More engineers and developers building dishwashers with New! Improved! sprays….

Book review of "naming nature" by Carol Kaesuk Yoon

anders's picture

I was curios about the tradeoff between accuracy and interpretability in other fields so I checked out a book on taxonomy.

According to Carol Kaesuk Yoon categorizing living things (and foods) is a basic human need. However the flood of new plants and animals brought to europe by the age of exploration provided new problems for classification. To deal with it, Lineus came up with a single system for every natural thing (rocks included).
The continuing discovery of more and more species lead to the professionalization of taxonomy. You had to have a good deal of experience with a particular family of life to get a good feeling for how they should be grouped. The flaw with this method was that because the decisions were based on personal weighing of evidence, it was difficult to resolve disagreements on how things should be classed.

There were three movements to rationalize taxonomy. The first in the 50s was based on using computers to classify creatures based on number of shared features (chapter 8 has a great discussion of the cloudiness this pushed to the edges). The second was based on classifying based on how similar the amino acid sequences were. And the third said that the only thing that matter was not general similarity but shared innovations instead.
Each of these movements was more scientific than the last (and tremendously criticized as completely lacking judgement by traditionalists) and brought new unexpected discoveries, but also each one also brought taxonomy farther and farther from the everyday intuitive classification of life. This culminated for Kaesuk Yoon in the destruction of “fish” as a taxonomic category. The argument goes, that nestled within the fish is the clade of tetrapods, a group that includes every amphibian reptile and mammal. If “fish” was a clade then we would have to include every tetrapod as a fish, therefore fish is not a meaningful taxonomic group.

Kaesuk Yoon said that when she started writing the book she was confident that science was the only way to order the world, but as she wrote she realized that her own two eyes tell her that fish obviously exist. Because of this she realized that there is more than one true way of organizing the world and that she like everyone else was continuously moving between them. The upshot of all of this is that since “fish” is not a taxonomic category but a naive classification, whales are absolutely 100% fish.

Vastly more complex than the hypotheco-deductive game

nostalgebraist, thanks for this comment, and for the link to your discussion with discoursedrome. I’ve been chewing on them for a couple days, and am not sure I’ve finished yet, but:

In data science, I gather, you are often handed a big file full of numbers, with little or no explanation for where they came from, and you look for patterns in them. And sometimes you can find something important, using “second culture” methods, and that’s great.

And, sometimes this is all that’s possible, because the reality that generated the data simply isn’t understood. Your point that often the phenomena modeled being vastly more complex than what could be captured by any standard form of statistical model seems importantly right.

But, if one can know something about the underlying reality, that and the statistical methodology ought to inform each other. I’d like it to be the case that the hypothetical “third culture” is about that.

To take a highly simplistic analogy, if you are trying to generate accurate tide tables, you don’t want to fit a polynomial to your historical time series data. You don’t want to fit even a Fourier series. You want to start from a Newtonian model of the sun, moon, earth relationship, and parameter fit to that. We know the correct form of the model from other types of evidence.

Again, this is often not possible, but I take “third culture” methods as attempts to at least partially constrain the type of model used through understanding of the domain derived from sources other than analysis of the data itself.

What if, in many subjects of interest, our mechanistic hypotheses are just too simple?

Plausibly this is what has happened in both (e.g.) social psychology and nutrition. Plausibly there simply is no way to make any genuine progress in either of those fields, for now. All the existing knowledge has turned out to be false, and it may be that no knowledge can be derived from methods currently conceivable. (This brilliant but sad podcast is about the current state of social psychology; I’ve ranted about nutrition here.)

Your perception analogy is very interesting. I don’t think I’ve finished thinking it through yet!

Gelman & Shalizi

Computer Nerd, thanks for the pointer. It happens that I read that paper a few weeks ago. I found it highly inspiring, too!

I took it as “third culture” in advocating trying to get the form of the model to reflect your understanding of the underlying reality as much as is feasible. Within the passage you quoted:

the data source resembles the model closely enough, in the respects which matter to us, that reasoning based on the model will be reliable.

A couple 0ther quotes along the same lines:

the better our probability model encodes our scientific or substantive assumptions, the more we learn from specific falsification.

The appropriate design depends on many contingent material facts about the system we are studying

It would be ideal if the model exactly matched reality, but this is rarely feasible (outside the hardest of hard science). So what one aims for is “true enough for current purposes” (an attitude that I will describe as characteristically meta-rational). That is: “the data source resembles the model closely enough.”

And so the work of “third culture statistics” is in finding ways to do that. This involves a “reflective conversation with the materials”—a design cycle (as in my OP) of building a trial model, evaluating it (visually, in statistics as well as architecture!), scrapping it, and using the intuition derived from that attempt in creating another. Their example of analyzing redistricting data fits this pattern, e.g.

Add new comment

To post a comment, you must enable Javascript and reload this page.

Navigation

You are reading a metablog post, dated June 30, 2018.

This was the most recent metablog post.

The previous metablog post was Circumscription: a logical farce.

This page’s topic is Rationalism.

General explanation: Meaningness is a hypertext book (in progress), plus a “metablog” that comments on it. The book begins with an appetizer. Alternatively, you might like to look at its table of contents, or some other starting points. Classification of pages by topics supplements the book and metablog structures. Terms with dotted underlining (example: meaningness) show a definition if you click on them. Pages marked with ⚒ are still under construction. Copyright ©2010–2018 David Chapman.