Stanford-Binet IV, of course! Time marches on!
Robinson, N.
Roeper Review
Vol. 15, No. 1, pp. 32-34
September 1992

This article by Nancy Robinson states her case for why she thinks the Stanford-Binet IV is the best current test for intelligence assessment. Though, she admits that there is still work to be done to improve it. (NOTE: This article pre-dates the publication of the Stanford-Binet V, which replaced the SB-IV.)

    Nancy M. Robinson is Director, Halbert Robinson Center for the Study of Capable Youth, University of Washington and a member of the Editorial Advisory Board of the Roeper Review. NOTE: This article pre-dates the publication of the Stanford-Binet V, which replaced the SB-IV.

Let me state, from the outset, my emotional commitment to the Stanford-Binet. I am a Stanford product (B.A. 1951, Ph.D. 1958), as was my husband; indeed, we met, as undergraduates, in Maud Merrill's class. That beloved, shy, reserved, dignified, and caring lady was our mentor for the rest of her days. Her testing course constituted a make-or-break criterion for our continued graduate study, and the 1937 Stanford-Binet became a part of us and of our deepest understanding about what intelligence, giftedness, and mental retardation "really" are. Through the vehicle of the test, we learned to deal with little children as well as big ones, and developed a gut-level sensitivity to various mental ages and levels of ability. This was our bedrock.

And yet, time marched on, even in those years. By 1960, Maud ruefully acknowledged that Wechsler's deviation IQs would prove superior to ratio IQs for test construction, score, stability, and even meaningfulness, and her 1960 Stanford-Binet reflected this understanding. Its standardization was imperfect; it was based primarily on other investigators' existing databases of Forms L and M; and the ethnic makeup of the populations was not identified, As time went on and the age of computers arrived, more sophisticated and competing factor-analytic approaches became practical, and the monotheistic belief in g-factor intelligence became less compelling, though never forsaken. Yet, there was much to recommend the old test: its developmental format, its diversity of items, its age structure, and its appeal to individuals of mental levels from age two years to Superior Adult. You "knew where you were" when you used that test! What a relief that Robert L. Thorndike consented to update the 1960 norms for the 1972 Revision! But it has been 20 years since then; the old ways are ready to go.

There are two principal reasons to adopt the new form of the Stanford-Binet. The first is the obvious superiority of more recent norms, whatever the causes, all over the world people of all ages are getting better at taking tests (Flynn, 1987). Every couple of decades, the norms have to be adjusted about one-half a standard deviation.

At the higher ends of the scale, differences tend to be greater than at the mean (see, for example, WISC-III vs WISC-R, [Wechsler, 1991]), and social changes sometimes produce age variations in test-taking skills (and, perhaps, underlying cognitive skills), as happened dramatically between 1960 and 1972 for preschool children in this country. Sticking with the old norms will certainly raise our estimates of the prevalence of very high IQs but will provide less and less accurate information.

Adding to the risk of error in using the test with very bright children, Thorndike (Terman & Merrill, 1972) never provided the statistical information necessary to extrapolate scores on the 1972 version as had Pinneau in the 1960 version, although Thorndike's curious decision to reprint the 1960 version in its entirely along with his 1972 norms led many testers seriously astray, Not since 1960, then, has there been a way to use the tables to interpret scores above IQ 165, even for children younger than age 12 years, 8 months, who would top out at that IQ with a perfect performance (MA = 22 years, 10 months). Stanford-Binet IV tables also stop at 165, but at least one can compare a given child's performance with that of other children today, not children of 1972 or, worse, 1960. Furthermore, with children who score above norms, or are able to perform on subtests not usually administered to their age group, one can use the table provided on page 157 in the supplementary manual to estimate age-level equivalents of Stanford-Binet IV subtests.

We have, of course, since abandoning the error-ridden ratio IQ, always been on shaky ground in calculating scores vastly deviant from the mean; there are so few individuals with competence more than four standard deviations above the mean (IQ 164+) that the true shape of the curve at those levels is unknown. Perhaps there are more such children "out there" than would be predicted - I hope so - but estimates are dangerously dependent on errors in population sampling errors and test construction, often errors which escalate as one goes farther from the mean.

The second significant advantage of the new scale is its factorial structure, imperfect though it is. Numerous authors have presenting convincing evidence that g-factor intelligence, represented by substantial and reproducible intercorrelations among cognitive measures, is not the whole picture. Wechsler's (1991) verbal-performance split, despite its popularity, is actually the least theoretically useful distinction of all, although various post-hoc analyses (limited, of course, to the existing subtests) and some expansion of the subtests on WISC-III have been helpful. Whether one opts for Cattell and Horn's (1982) model of fluid vs. crystallized intelligence, as did Thorndike, Hagen, and Sattler (1986a, b) in creating Stanford-Binet IV, or for Gardner's (1983) theory of multiple intelligences, Sternberg's (e.g., 1982) componential theory, an information-processing approach such as Siegler's (1996), or some other theoretical position, the power of using an intelligence test for differential understanding of a child's pattern of abilities is at this point difficult to deny.

The hierarchical theoretical framework adopted by Thorndike, Hagen, and Sattler from Horn and Cattell (1966, 1982) (with attribution only in the Technical Manual [1986b]), is probably as good as any for the kinds of school-related predictions for which we usually use intelligence tests. The authors retained Spearman's (1904) and Terman and Merrill's orientation to g-factor intelligence (the Composite Score, like an IQ, represents overall performance on the test), but added another level represented by crystallized abilities (further sub-divided into verbal and quantitative reasoning areas), fluid-analytic abilities, and short-term memory, thus creating four domains, each domain further represented by three or four scales, The distinction between crystallized and fluid intelligence reflects concern for the acquired cognitive skills needed to solve new verbal and quantitative problems in school (affected by experiences both in and out of school), and abilities presumably less affected by schooling.

Meager rationale is offered by Thorndike et al for adopting this particular structure, especially for equating "fluid-analytic" abilities with abstract/visual (spatial) reasoning, and confirmation of the factorial structure of the test seems to have followed, not preceded, its final stage of development. So be it. It still makes sense, and is an improvement over the even more jumbled factorial structure of the Wechslers. Particularly useful are the separation of verbal from quantitative reasoning, and the possibility of looking at quantitative reasoning from several vantage points. Similarly, the definition of a separate short-term memory domain using sequential materials of visual-spatial, numerical, and verbal natures, permits a nicely detailed analysis (or omission, if memory is not an issue).

The nature of the tasks themselves is true to the Stanford-Binet tradition of emphasis on abstraction and the solving of novel problems, although the pervasive role of verbal reasoning on the older versions has been brought under control, Most of the item types were derived from Forms L and M, Vocabulary is there, and Picture Absurdities (though, alas, not Verbal Absurdities.) In many instances, the child is asked to derive and then apply the rule(s) underlying a set of stimuli - for example, to respond to a series of numbers by stating the next two, to tell how three things are alike but different from a fourth, or to pick the unfolded version of paper which is depicted being folded over and over and then cut. A set of visual matrices is included, not unlike Raven's Progressive Matrices, which has long been considered an excellent exemplar of g.

There are some nice touches to this scale. For example, although it is divided by item types, there is an effort to avoid a long string of uninformative successes and to confine most of the testing effort to the critical region between a basal and a ceiling. Equivalent developmental levels are designated by letter across subtests, and age-level equivalents of raw scores are provided in both the Examiner's Handbook and the Technical Manual. Although entry levels for subsequent sub-tests arc generally identified by a combination of chronological age and ceiling items on Vocabulary, the tester can be guided by an emerging knowledge of the specific child. Following Stanford-Binet tradition, failures below the basal and successes above the ceiling, if they occur, can be taken into account. Finally, the tester has a useful opportunity to watch strategies being invented on the spot - for example, when the child uses scratch paper on the Quantitative Reasoning scales, or struggles with successive short-term memory items. Best of all, the tester has discretion as to which subtests to use. Not only can domains be eliminated if not of interest, but subtests can be eliminated if, for example, they are too easy for a given child.

I have used the scale extensively with gifted children from age 24 months to age 14 years. It works from age 30 months onward, but it is most effective, with children of early school age, say, 5 to 11 years. In this range, it has the most "top," is the most interesting to the children, and gives the tester the widest choice of subscales.

For the very youngest children, the test is less intriguing (all the little toys are gone!) and provides too little instruction in test-taking skills; significantly, no children below age 4 were included in the first try-out before standardization, although they were included in the second. With exceedingly bright (but not moderately bright) children above age 11 or 12, many of the most difficult items can prove too easy, although one can certainly get a sense of the power of such a child's reasoning, and usually some differential patterns of ability.

The new test has some definite drawbacks, to be sure. It unmistakably reflects a psychometric emphasis rather than a developmental one. There are few qualitative as opposed to quantitative developmental progressions (such as the familiar progression from Differences to Similarities to Similarities-and-Differences to Essential-Similarities); age spans by which scores are read from tables were kept far too broad (as wide as a year for older children) in order to base them on the large numbers of cases psychometricians trust; some items are out of order for bright children with limited experience but superior reasoning skills; within domains, intercorrelations of subscales tend to be only moderate and irregular (e.g., Absurdities, called a verbal reasoning subscale, has a heavy visual component); scoring of items such as Vocabulary, Comprehension, Copying (drawing), and Memory for Sentences, unlike the Wechslers, is 0-or-none; and, because of the multiple-choice format of several subscales, giving the test to a thoughtful, reflective older child can be a colossal bore for the examiner. Furthermore, for gifted populations, too few scales proceed to the most difficult levels, so that ceilings are too easily reached, although careful choice of subtests can circumvent some of this problem.

In sum, it is the more recent norms and the factorial structure of the test that, to me, make Stanford-Binet IV the choice over its predecessor. Even if the 1972 Stanford-Binet were to be renormed, its restrictive emphasis on g-factor intelligence and verbal reasoning, its uneven content from one age to another, and the lack of control it gives the examiner, make it a relic - a beloved relic, to be sure. There is plenty of work to be done to make Stanford-Binet V a better test, but Stanford-Binet IV is a decent beginning. It is, I believe, for bright children, the test of choice over a wide age range, in comparison with its recently standardized Wechsler competitors, even though the popularity of the Wechslers often dictates their use. But I will always have a very fond place in my heart for Form L (for Lewis) and Form M (for Maud) and for Form L-M. The world won't be the same without them.


Flynn, J.R. (1987). Massive IQ gains in 14 nations: What IQ tests really measure. Psychological Bulletin, 101, 171-191.

Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. New York: Basic Books.

Horn, J.L., & Cattell, R.B. (1966). Refinement and test of the theory of fluid and crystallized intelligence, Journal of Educational Psychology, 57, 253-276.

Horn, J.L. & Cattell, R.B. (1982) Whimsey and misunderstandings of Gf-Gc theory. Psychological Bulletin, 91, 623-633.

Siegler, R.S. (1986). Children's thinking. Englewood Cliffs, NJ: Prentice-Hall.

Spearman, C.E. (1904). "General Intelligence" objectively determined and measured. American Journal of Psychology, 15, 201-292.

Sternberg, R.J. (1982). A componential approach to intellectual development. In R.J. Sternberg (Ed.) Advances in the psychology of human intelligence, Vol. I (pp. 413-463). Hillsdale, NJ: Erlbaum.

Terman, L.M. , & Merrill, M.A. (1972). Stanford-Binet Intelligence Scale, 1972 norms edition. Boston: Houghton Mifflin.

Thorndike, R.L., Hagen, E.P., & Sattler, J.M. (1986a). The Stanford-Binet Intelligence Scale, Fourth Edition: Guide for administering and scoring. Chicago: Riverside.

Thorndike, R.L., Hagen, E.P., & Sattler, J.M. (1986b). The Stanford-Binet Intelligence Scale, Fourth Edition: Technical Manual. Chicago: Riverside.

Wechsler, D. (1991). Wechsler Intelligence Scale for Children, 3rd edition Manual. San Antonio, TX: Psychological Corporation Harcourt Brace Jovanovich.

Permission Statement

The appearance of any information in the Davidson Institute's Database does not imply an endorsement by, or any affiliation with, the Davidson Institute. All information presented is for informational purposes only and is solely the opinion of and the responsibility of the author. Although reasonable effort is made to present accurate information, the Davidson Institute makes no guarantees of any kind, including as to accuracy or completeness. Use of such information is at the sole risk of the reader.

Close Window