Issues Regarding the Reliability, Validity and Utilityof
|
It has been suggested that students' evaluations of their instructors began at the universities of medieval Europe (Centra, 1993). In the modern era, there has been a tremendous increase in interest regarding student ratings of instruction and this topic has been the subject of a substantial body of research spanning approximately 70 years (Arreola, 1995). (Currently, an ERIC search produces over 1300 records related to the topic of student evaluations.) Student ratings were first used in North American universities in the mid-1920s (d'Apollonia and Abrami, 1997). According to Greenwald (1997), "the validity of student rating measures of instructional quality was severely questioned in the 1970s. By the early 1980s, however, most expert opinion viewed student rating measures as valid and as worthy of widespread use" (p. 1182). In Greenwald's review of the historical trends in research on student ratings, he notes that over the 25-year period from 1971 to 1995, the majority of publications supported the validity of student ratings (Greenwald, 1997). The recent literature on student ratings (in addition to significant reviews published after 1980), suggest that many of the questions explored in the 1970s regarding the validity of student ratings have been sufficiently answered as a result of subsequent research (Greenwald, 1997). Marsh (1984) comments that:
Probably students' evaluations of teaching effectiveness are the most thoroughly studied of all forms of personnel evaluation, and one of the best in terms of being supported by empirical research...(March, 1984, p. 749).
The goal of this review was to assess what is known regarding various issues that have been raised about student ratings of instruction, as identified by the APC Implementation Task Force on Students Ratings of Instruction (hereafter, the "Task Force"). The initial identification of the research issues addressed was based on a review by Arreola (1995), and was supplemented
by additional issues identified by the Task Force. The task was initiated by an analysis of two reviews on frequently voiced concerns about student ratings (Aleamoni, 1987, Arreola, 1995), as well as five recent articles on student ratings in the November, 1997 issue of American Psychologist (Volume 52, number 11). Many of the sources cited in these various articles, as well as the significant amount of research archived in the Teaching Development Office and the Office of the Vice-President (Academic) of the Students' Union were also considered. To provide information on student ratings specific to the University of Calgary context, the two-phase psychometric evaluation of the proposed University of Calgary Instrument for the Task Force by Creating Organizational Excellence (COE), were also reviewed. The COE research (Phase I, COE, 1997; Phase II, Volumes 1 and 2 - Kline, T. and Rever-Moriyama, S., 1998) was carried out in response to the mandate of the General Faculties Council (GFC) to the Task Force for a full-scale psychometric evaluation of the proposed ratings instrument as part of its approval process (GFC meeting # 420, 1997-02-20). The following literature discussion is not intended as an exhaustive review, but as a series of task-relevant "snapshots" of current research- based conclusions in this field.
Unfortunately, despite the importance of this research for decisions regarding the use of student ratings of instruction, often research on this issue is not familiar to most in the university community. And as Centra (1993) notes in a study by Franklin and Theall in 1989, this has had a significant impact on attitudes toward student evaluations. This study tested 670 college faculty members and administrators at three colleges on their knowledge and their opinions regarding student evaluations. The scores were similar across all three institutions and indicated a lack of knowledge with respect to the research on student evaluations. Furthermore, the study showed that knowledge and attitudes were intertwined. That is, "respondents who knew more about the research also had more favorable attitudes about the use and potential of student evaluations" (Centra, 1993, p. 48).
The research suggests that concerns regarding issues of reliability and validity are germane for rating forms that have not gone through the rigorous psychometric testing necessary to produce professional rating forms. This would apply to instruments "patched together" by the administration, faculty or students and which are, in essence, "home-made" (Arreola, 1995). However, according to a review of the literature conducted by Aleamoni (1987) and Arreola (1995), well-developed, tested, student rating forms are both reliable and valid.
Reliability indicates how consistent a set of items measures a particular construct or set of constructs. This can refer to consistency across raters (e.g. all students rate a professor as a "4"), termed inter-rater reliability; across time (e.g. a professor receives the same ratings every semester) termed test-retest reliability; across items (e.g. a professor is consistently rated highly on all the "organization" items) called internal consistency. In short, reliability provides information on the extent to which a given measurement will give similar information in different contexts or times of measurement. It should be noted that reliability of one type does not necessarily mean that reliability of all types is high, and also that, reliability is a necessary pre-condition for validity.
Inter-rater reliability or "interrater agreement" is a key indicator of the reliability of student rating forms. Marsh and Roche (1997) state that the reliability of student rating forms "is most appropriately determined from studies of interrater agreement that assess agreement among different students within the same course" (p. 1188). In a study conducted by Marsh (1987), he found that while correlations indicative of reliability between two raters were low (.20s), the reliability of the class average response was high. He also noted that the reliability of the class average response depends on the number of students in a class. Reliability correlations (where 1.0 indicates a perfect correlational relationship) were: .95 for 50 students, .90 for 25 students, .75 for 10 students, and .60 for 5 students (Marsh, 1987).
The findings of other researchers (e.g. Costin, Greenough and Menges 1971; Marsh, 1984) also support the reliability of student rating forms reporting reliabilities for professionally constructed forms to be approximately 0.90. Aleamoni (1987) cites several rating forms with a reliability of 0.90 and above. These include the Educational Testing Service (ETS) instrument, Student Instructional Report, Instructional Development and Effectiveness Assessment and the Arizona Course/Instructor Evaluation Questionnaire. Centra (1993) maintains that reliability estimates for student ratings of instruction are "good" (p. 58). Finally, Costin, Greenough and Menges (1971) comment:
It would appear, then, that students can rate classroom instruction with a reasonable degree of reliability. In particular, the evidence cited concerning the stability of students' ratings argues against the connection (sometimes made by opponents of student ratings) that opinions of instruction are difficult to interpret, since they might be made after a particularly good or bad atypical experience (p. 513).
Most researchers agree that validity-the degree to which a test actually measures what it is supposed to measure-is more difficult to determine than reliability. Nonetheless, numerous researchers (Abram), d'Apollonia and Cohen, 1990; Cohen, 1981; Feldman, 1989; Marsh, 1987) all conclude that student ratings of instruction are, indeed, valid.
Much of the evidence that supports the validity of student rating forms arises from studies in which student ratings are "correlated with other indicators of teacher competence" (Arreola, 1995). For example, student ratings are often correlated with colleague ratings, trained observers' ratings, alumni ratings or measures of student learning. Aleamoni and Hexner (1980) cited 14 studies in which student ratings were compared to the above indicators. Moderate to high positive correlations were found which, in turn, support the validity of student ratings of instruction. Similarly, Murray (1984) summarized several general reviews (Marsh, 1983 cited in Murray, 1984; McKeachie, 1979; Murray, 1980 cited in Murray, 1984) and concluded that:
Student ratings of classroom teaching correlate moderately to highly (0.50 to 0. 90) with comparable ratings made by supervisors, colleagues, alumni, and paid classroom observers, indicating that student perceptions of good and poor teaching are similar to those of more expert, more mature, and more neutral observers (p. 119).
Let us now examine three of the above-mentioned indicators of teaching competence - student learning, trained observers, and colleague ratings - more closely.
d'Apollonia and Abrami (1997) looked closely at the student learning criterion for the validity of student ratings. This is based on the assumption that the students of more highly rated professors should be learning more.
One method to test this hypothesis is the multi-section validity design which examines multiple sections of the same course taught by different instructors, but using a common exam. Consistent with this hypothesis, the section average score on student ratings is correlated with the section average score on the common exam (d'Apollonia and Abrami, 1997). Six published meta-analyses of multi-section validity studies that gave conflicting conclusions were compared by Abrami, Cohen and d'Apollonia (1988 cited in d'Apollonia and Abrami, 1997). Cohen (1981 cited in d'Apollonia and Abrami) "concluded that student ratings were valid measures of teaching effectiveness, whereas Dowell and Neal [1982 cited in d'Apollonia and Abrami] and McCallum [1984 cited in d'Apollonia and Abrami] concluded that ratings, at best, were poor measures of teaching effectiveness" (p. 1201). The meta-analyses included (Abrami, 1984; Cohen, 1981, 1982, 1983; Dowell and Neal, 1982; McCallum, 1984 all cited in d'Apollonia and Abrami, 1997). In their review of these meta-analyses, however, d'Apollonia and Abrami (1997) "identified a number of decisions made by the reviewers that biased their findings" (p. 1201). These included using different analytical techniques and varying inclusion criteria (p. 1201). Subsequently, d'Apollonia and Abrami conducted their own meta-analysis in 1996, correcting for these problems, and concluded that "there was a moderate to large association between student ratings and student learning, indicating that student ratings of General Instructional Skill are valid measures of instructor-mediated learning in students" (1997, p. 1202).
Murray (1983) conducted a study in which he trained observers to evaluate particular teaching behaviours of 54 university professors with previous student ratings ranging from high to low in other classes. Each professor was carefully watched for 18 to 24 hours in total throughout the semester. The results indicated that student ratings and trained observers' ratings were comparable. As reported by the observers, there were differences in teaching behaviours between those professors with high ratings and those with average or low ratings. For example, professors with high ratings repeated difficult concepts to aid in clarity, spoke more expressively (showed enthusiasm) and displayed sensitivity to student needs. In considering Murray's study, Centra (1993) notes that, "...in short, student evaluations seemed to be determined by actual classroom behaviours rather than by pleasing personalities or other invalid indicators" (p. 63).
Although it has been suggested by some that students are not able to make consistent judgments about the instructor and instruction, the evidence available suggests that student ratings are as good or superior to peer evaluation by colleagues.
Arreola (1995) points out that the stability of student ratings from year to year is very high with correlations in the range of 0.87 to 0.89. Similarly, research cited by Costin, Greenough and Menges (1971), and studies completed by Gillmore (1973 cited in Arreola, 1995) and Hogan (1973) report correlations from 0.70 to 0.87 for "student ratings of the same instructors and courses" (Arreola, 1995, p. 83). Aleamoni (1987) also concludes that the literature suggests that students' judgments are fairly stable and consistent.
As previously noted, several researchers (Aleamoni and Hexner, 1980; Aleamoni, 1987; Murray, 1984; Arreola, 1995), either in their reviews of the literature or in the conduct of their own studies, have concluded that student ratings of instruction correlate well with colleague ratings. However, there have been some studies (see Centra, 1975) in which the correlation between student ratings and peer ratings based on classroom visitation were low (r=0.20). In these studies, however, the agreement between the peer observers in this study was also low (r=0.26). Due to these results, one researcher suggested that this "brings into question their [peer ratings] value as a criterion of effective teaching and precluded any good correspondence with student ratings" (Marsh, 1987. p. 294). In a report for the Ontario confederation of University Faculty Associations, Murray argued that in comparison to student ratings, colleague ratings can be: "(1) less sensitive, reliable, and valid; (2) more threatening and disruptive of faculty morale; [and] (3) more affected by non-instructional factors such as research productivity" (p. 45, cited in Galbraith, 1997).
Recent research addressing the validity of student ratings of instruction was featured in the November, 1997 issue of American Psychologist (volume 52, number 11). In the Current Issues section, that focused on student evaluations of instructors, all 7 authors (Greenwald, 1997; Marsh and Roche, 1997; d'Apollonia and Abrami, 1997; Greenwald and Gillmore, 1997; McKeachie, 1997) support the validity of student ratings. McKeachie (1997), charged with commenting on all of the other articles in the section, reports that all of the authors "agree that student ratings are the single most valid source of data on teaching effectiveness" (p. 1219) and to sum up March and Roche's (1997) comments, McKeachie adds, "there is little evidence of the validity of any other sources of data" (p. 12 I 9).
Demming (1972) notes that a "prime requirement" for an instructor is possessing "some knowledge to teach" and to him, the way to acquire such knowledge was to conduct research that was "worthy of publication" (p. 47). Therefore, in his view, only faculty with good publication records and experience are qualified to evaluate their peers' instruction; students lacking such a record of scholarship and the knowledge it produces, are not competent to evaluate their professors. Arreola (1995), however, points out that the impact of an active research program on teaching is not a robust one. He cites studies (Maslow and Zimmerman 1956; McDaniel and Feldhusen, 1970; and Stallings and Singhal, 1968, all cited in Arreola, 1995) that reported only weak positive correlations between "research productivity and teaching effectiveness" (p. 83). Numerous other studies (Aleamoni and Yimer, 1973; Costin et al., 1971; Linsky and Straus, 1975; Hayes, 1971 and Voeks, 1962 both cited in Arreola, 1995) identified no significant correlations between research productivity and student ratings of teacher effectiveness (Arreola, 1995). For example, Aleamoni and Yimer's (1973) research reported correlations between peers' ratings of instructional effectiveness and research productivity as 0.07; Between student ratings and research productivity the correlation was -0.04. As Arreola (1995) argues, there is no "clear, consistent evidence" in the research that implies that only excellent researchers can make excellent instructors and that they "are the only people qualified to evaluate teaching" (p. 84).
Since student ratings are usually carried out anonymously, it is highly problematic to compare ratings of a given student or a group of students, years after they have graduated. Therefore, most of the research in this area looks at the relationship between student ratings by alumni, or graduating seniors, and those made by current students. Research in this area was conducted by Drucker and Remmers at Purdue University early in 1950 and 1951 (cited in Arreola, 1995, Aleamoni, 1987). High positive correlations were found between ratings of graduates of 5 and 10 years and currently enrolled students. Similar studies were conducted at the University of Illinois (Aleamoni and Yimer, 1974 cited in Aleamoni, 1987), and at the University of California, Los Angeles (Marsh, 1977) and produced similar results (Aleamoni, 1987), as did studies by Marsh and Overall (1979) and McKeachie, Lin and Mendelson (1978). In general then, the evidence suggests considerable stability in the ratings of courses and instructors; those rated most highly by current students are also likely to be highly regarded when considered retrospectively.
Many faculty believe this statement to be true, and accordingly, much has been written about this issue. In what has come to be termed the "Dr. Fox" studies, there have been results that suggest that instructors who are enthusiastic and expressive will receive good student ratings regardless of the content they deliver in their classes. In the original Dr. Fox study, a professional actor gave a lecture to educators and graduate students in a dynamic and enthusiastic manner, but devoid of meaningful content. Nevertheless, he received favourable ratings (Nauftin et al., 1973 cited in Centra, 1993). This original study, however, has been soundly criticized for methodological weaknesses. (See Marsh and Dunkin, 1997.)
In a reevaluation of subsequent "Dr. Fox" studies, Marsh and Ware (1982) discovered that when students are given an incentive to learn, (i.e. students know that they will be tested on the material), a situation much closer to the real university setting, the "Dr. Fox" effect did not occur. In other words, the instructor who is expressive, yet does not deliver the appropriate content, is rated highly only in those categories directly related to enthusiasm (i.e. "Instructor Enthusiasm") and receives appropriately lower scores in categories such as "Instructor Knowledge" and "Organization and Clarity" (Marsh and Roche, 1997). In a review of the "Dr. Fox" research, Abrami, Leventhal and Perry (1982) comment that much of this research has been fraught with inconsistencies in findings from various studies, which, in turn, has led to disagreement among reviewers.
Costin et al., (1971) reviewed Guthrie's (1954 cited in Costin et al., 1971) study in which he found that instructors who were highly rated were considered to be "substance teachers" and not simply "entertainers" (p. 518). Furthermore, in Murray's 1983 study, in which he employed neutral observers, he concluded that student ratings seemed "determined more by the actual classroom behaviours of the instructor than by extraneous factors such as "personality" or "popularity" (p. 146). Murray also reasons that "expressive teaching behaviors serve to communicate the lecturer's enthusiasm for the subject matter, and thereby elicit and maintain student attention to lecture material" (p. 147). This, in turn, assists students in remembering the material which they have learned-and consequently and appropriately, also affects the ratings students give their instructor. (Murray, 1983).
In other research, Grush and Costin (1975) found that the correlation between the personal attraction students held for their instructors and how highly they rated those instructors (i.e. "teacher skill") was low (Grush and Costin, 1975). Aleamoni (1976 cited in Aleamoni, 1987) reviewed thousands of written comments by students and discovered that while they praised instructors for their humour and enthusiasm, if their courses were not well-organi7ed, for example, the students also criticized their professors on this point. As Aleamoni (1987) says:
...the students are not easily fooled. In rating their instructors, students discriminate among various aspects of teaching ability: If a teacher tells great jokes and has the students in the palm of his or her hand in the classroom, he or she will receive high ratings in humor and classroom manner, but these ratings do not influence students' assessments of other teaching skills (p. 27).
In addition, research by Costin, Greenough and Menges (1971), Frey (1978), and Arreola (1983 cited in Arreola, 1995) identifies students " as discriminating judges of instructional effectiveness" (Arreola, 1995, p. 84). Centra (1993) summarizes the view of many researchers when he comments:
Do these findings indicate that student ratings are unduly affected by expressive instructors? Probably not. First, Abrami, Leventhal, and Perry (1982) mention that, a twenty- to thirty-minute videotaped lecture represents only a minuscule percentage of actual lecture time in a three-credit course. Second, such extreme manipulativeness is unlikely in real-life teaching situations. Few college teachers provide no content in their courses and instead substitute enthusiasm. For these reasons, generalizations from the laboratory experiments to actual classroom teaching are tenuous. But if we were to generalize, a reasonable lesson from seduction research would be that by teaching more enthusiastically, teachers will receive high ratings and their students will retain more of the course content (p. 77).
A recent study by Williams and Ceci (1997), however, found that instructor enthusiasm had a strong biasing effect on student ratings. One of the authors (Ceci), took a faculty development seminar to improve his "presentational style" since he had consistently received average ratings. When he taught the course again, he made the same main points, used the same text, syllabus and overheads but changed the level of his enthusiasm and used the presentational techniques (voice inflection, gesturing) which he had learned in the seminar. The end result was that his student ratings were significantly higher than his previous ratings. Furthermore, his ratings improved in areas not "directly related" to instructor enthusiasm (i.e. "knowledge," "accessibility outside of class"). In addition, there was no positive correlation (as might be expected), between instructor enthusiasm and student learning. The students in his more "enthusiastic" class did not do better on tests than his previous students.
Although this study shows an "enthusiasm effect" that appears to question the validity of the ratings, a number of points should be noted about the study. For example, although students did not perform any better on the exams in the more enthusiastic condition, it is not clear how much lecture material was even on the exams. If the exams were primarily on the text, enthusiasm should not be expected to exert much effect on performance. Also, the effect was evaluated only with one class, rather than several and no attempt was made to reverse the effect with a third class. Thus, it is not clear that the results observed did not simply reflect that the two classes were composed of different people at a different time.
None of the "In Response" writers that were included with the article seemed to sccept [sic] that the implications of the Williams and Ceci article were as serious as did the authors. In fact, d'Apollonia and Abrami (1997b) severely criticize the study noting that Williams and Ceci's literature review is "selective, biased, and erroneous" and the research itself has a number of serious "methodological flaws" (p. 18). d'Apollonia and Abrami challenge the claim made by Williams and Ceci that student ratings are invalid and biased and suggest that their study has little or no value. While this study may suggest some future research to define appropriate limits on the use of student rating data, they view the study as so poorly done that it offers no basis for strong conclusions. Furthermore, as Brown (1998) reminds us, this is only one study in comparison to numerous others which offer opposite results. There are certainly numerous factors that can affect students' performances on exams. Brown does point out that enthusiasm alone will not help an instructor with serious "flaws":
What the study [Williams and Ceci, 1997] shows, at a minimum, is that a well structured course with a well chosen text book and clear syllabus can be considerably down-graded by students if the instructor lacks enthusiasm. It does not show that a poor instructor can get better ratings on a flawed course simply by being more enthusiastic (p. 6).
It is also important that the advantages of an expressive and enthusiastic instructor for student outcomes beyond test performance should not be overlooked. These include such variables as class attendance, selection of courses and majors, and perceived approachability of the instructor. For example, Phillips (1998) conducted a study at York University in which he collected student opinion regarding student evaluations of teaching. He commented that:
Students admitted that personality did enter into their assessment and that they would most likely rate the charismatic lecturer more highly. However, they insisted that this was relevant to the question of the effectiveness of the pedagogy. To quote one student "if I am bored I learn less...if I am constantly engaged by the teacher I learn more" (p. 9).
In conclusion then, with the exception of this recent study by Williams and Ceci (1997), the belief that student ratings are based on popularity or personality variables has not been substantiated by the literature.
The "expected grades/grading leniency" concern is perhaps the most controversial and, according to Arreola (1995), the most researched, of the potential biases to student ratings. Murray (1996), however, points out, to the degree that higher grades reflect greater learning, a positive relationship between grades and ratings is appropriate:
...the average correlation of 0.28 found between grades and ratings may reflect a tendency for highly rated teachers to foster high levels of learning in their students, which in turn results in justifiably higher student grades. In other words, the positive correlation between grades and ratings may be a valid reflection of differential teacher effectiveness rather than an impetus for grade inflation (p. 18).
Marsh and Roche (1997) also point out that research on the grading leniency effect indicates that the effect is both "weak" and "the size of such an effect is likely to be unsubstantial" (p. 1192). Similar to Murray, they also note that:
Class-average grades are correlated with class-average students' evaluations of teaching, but the interpretation depends on whether higher grades represent grading leniency, superior reaming, or preexisting differences (p. 1194).
Greenwald and Gillmore (1997) posit five theories intended to explain the grades-ratings correlation. These theories are: (1) teaching effectiveness influences both grades and ratings; (2) students' general academic motivation influences both grades and ratings; (3) students' course--specific motivation influences both grades and ratings; (4) students infer course quality and own ability from received grades; and (5) students give high ratings in appreciation for lenient grading (see a discussion of these theories in Greenwald and Gillmore, p. 1210-11).
The first and fifth of these possibilities are the most directly contradictory. The first theory holds that in courses taught by good instructors, students learn a lot, deserve high grades, and as a result of their learning, give appropriately high ratings to their instructors. Therefore, "instructional quality" adequately explains the grades-ratings correlation (Greenwald and Gillmore). However, Greenwald and Gillmore argue for the fifth theory that undeserved grades produce undeserved high ratings. These researchers point out that this theory was supported by critics of student evaluations in the 1970s. However, support for the "leniency" theory dropped sharply due to "correlational construct-validity research conducted in the late 1970s and early 1980s" (p. 1211).
Studies examining construct validity attempt to answer the question "do student ratings measure the construct (i.e. teaching effectiveness) they are supposed to measure?" (See the section III.1. on validity for a review of these studies.) Construct validity then, is a measure of whether, and the extent to which, a given survey (or other measure) captures the concept it was designed to assess.
Greenwald and Gillmore (1997) conclude that the results of their study showed that grading leniency does influence student ratings to a degree sufficient to warrant a statistical correction in order to "remove the unwanted inflation of ratings produced by lenient grading" (p. 1209). Their claims were reiterated (along with those of Williams and Ceci regarding the effects of instructor enthusiasm), in a recent article of The Chronicle of Higher Education authored by Wilson (1998).
Greenwald and Gillmore's colleagues and fellow writers of the Current Issues section on student evaluations (see American Psychologist, November, 1997) ardently disagree with their conclusions. According to Brown (1998), the major criticisms of Greenwald and Gillmore's study are two-fold. One suggests that there may be other possible explanations than the five theories they discuss and they have not dismissed these other explanations. The other criticism has been that what these authors have studied may not be lenient grading at all, but rather, just high marks. Brown (1998) suggests two points to consider:
"[I]t's possible that [Greenwald and Gillmore's] "lenient graders" are really just more effective teachers who deserve the higher evaluations and whose students earn higher grades. The other suggestion has been that it's not clear that even lenient grading falls outside the circle of teaching effectiveness: to the extent that getting higher grades is motivating to students, a tendency to assign them may in fact be relevant to teaching effectiveness" (p. 5).
McKeachie (1997) also finds Greenwald's and Gillmore's argument "flawed" on a number of counts (see detailed discussion by McKeachie, 1997). He agrees with Greenwald and Gillmore that giving higher than deserved grades may result in receiving higher than deserved ratings, but only if the students are led to believe that they are learning more than "is typical." But McKeachie argues that "students are not so likely to be positively affected if an ineffective teacher seems to be trying to buy good ratings with easy grades" (p. 1220) and cites evidence that this tactic may, in fact, "boomerang." (See research by Abrami, Dickens, Perry, and Leventhal (1980) cited in McKeachie, 1997 that demonstrated the negative effect of lenient grading practices.) Marsh and Roche (1997) summarize their review of the literature on the "expected grades/grading leniency" concern and conclude that: "whereas a grading-leniency effect may produce some bias in SETs [students' evaluations of teaching], support for this suggestion is weak, and the size of such an effect is likely to be unsubstantial" (p. 1192).
The psychometric analysis of the proposed Universal Student Ratings of Instruction Instrument, conducted by Creating Organizational Excellence (COE), Phase II, vol. 1, indicated no significant correlations between expected grade and student ratings of instruction. (See further reports of the findings of the psychometric analysis below.)
A number of variables not directly relevant to academic performance have been suggested to affect student ratings of instruction. They include: size of the class, gender of the instructor and student, level of course (1st year as opposed to 4th year), rank of the instructor (instructor, assistant professor, associate professor), student workload, and the value-system or ideology of the instructor. What evidence exists for the importance of any of these variables? (Note: The findings of the psychometric analysis of the proposed Universal Student Ratings of Instruction Instrument, consisting of 12 items, will be noted where applicable. All citations in this section refer to COE Phase II, Vol. 1. report (T. Kline and S. Rever-Moriyama, 1998).
Many faculty believe that instructors who teach smaller classes are rated more highly than instructors who teach larger classes since smaller classes allow for more instructor-student contact. Aleamoni's (1987) review of the research (see Aleamoni and Hexner, 1980), however, did not yield significant relationships between class size and student ratings. Aleamoni and Hexner did cite older studies that showed a correlation between ratings and class size, but they also cited several studies that gave the opposite conclusions. Arreola (1995) describes the findings of some studies which reported a curvilinear relationship between student ratings and class size. That is, small (approximately under 30 students) and very large classes (approximately 120 students or more), are rated more favourably than those classes in the mid-range. (e.g. Kohlan, 1973; Linsky & Straus, 1974; Marsh, Overall, & Kesler, 1979; Pohlmann, 1975 all cited in Arreola, 1995).
In Marsh's (1987) comprehensive review of the research pertaining to student evaluations, in addition to his own study, he concludes that class size is not a bias to student ratings. Rather, class size has a "moderate" effect on particular aspects of "effective teaching (primarily Group Interaction and Individual Rapport) and these effects are accurately reflected in the student ratings" (p. 314). Marsh points out that the class size discussion serves to emphasize the multidimensionality of student evaluations; student evaluations cannot be comprehended fully without understanding their multidimensional nature (Marsh, 1987). In Marsh and Roche's 1997 overview of the relationships between a number of extraneous variables and student ratings, they state that there are "mixed findings" in relation to class size, "but most studies show smaller classes are rated somewhat more favorably, although some find curvilinear relationships where large classes also are rated favorably" (p. 1194). McKeachie's comments on the class size issue are also worthy of note:
The concern about class size seems to me to be valid only if a personnel committee makes the mistake of using ratings to compare teachers rather than as a measure of teaching effectiveness. There is ample evidence that most teachers teach better in small classes. Teachers of small classes require more papers, encourage more discussion, and are more likely to use essay questions on examinations--all of which are likely to contribute to student learning and thinking. Thus, on average, small classes should be rated higher than large classes (p. 1220).
The results of the psychometric analysis of the Universal Student Ratings of Instruction Instrument suggested that class size had "little impact" on student ratings (COE report, Phase II, vol. 1, p. 17). The study noted that class size did affect three specific items on the evaluation form inasmuch as smaller classes "were more likely to rate the following items more positively": 1) item 5, "Student input was treated appropriately"; 2) item 7, "Opportunities for course assistance were available", and 3) item 8, "Students were treated respectfully (p. 17)." However, class size did not affect the other nine of twelve items.
According to Arreola (1995), results in the literature regarding gender and ratings are inconsistent. Aleamoni and Hexner (1980) found no significant relationship between ratings and gender (of the instructor or student). Other researchers (Doyle and Whitely, 1974; Isaacson, McKeachie et al., 1964 cited in Arreola, 1995) support this conclusion. In Costin et al.'s (1971) review of the research, they also cite seven studies that confirm the absence of significant differences between the ratings made by male or female students, and the ratings received by male and female instructors.
In contrast, both Costin and associates and Aleamoni and Hexner also cite a study by Bendig (1952 cited in Costin et al., 1971) that show that female students tended to be slightly more critical of their male instructors than were their fellow male students and another study by Walker (1969 cited in Costin et al., 1971) that found that female students rated female instructors "significantly higher" than they rated male instructors (p. 520). Furthermore, investigations in the 1970s (Ashton, 1975; Kohlan, 1973; McKeachie et al., 1971; Pohlmann, 1975 all cited in Aleamoni and Hexner, 1980) indicated that female students rated instructors more highly in various areas, than male students in the same class. Arreola (1995) observes that there is no consistent view regarding the relationship between gender and student ratings of instruction and Marsh and Roche (1997) conclude that the gender issue has "mixed findings but little or no effect" (p. 1194). These latter conclusions are consistent with COE's psychometric analysis of the Instrument which reported no significant correlations between gender of either the instructor or the student and student ratings (Kline and Rever-Moriyama, 1998).
More studies are consistent with the belief that the level of the course exerts some effect on student ratings than not. Aleamoni and Hexner (1980) mention 8 researchers who found no meaningful relationship between the level of the course and student ratings. Conversely, they cite 18 other investigators who concluded that higher level students (e.g. graduate students, 4th year students) tend to give higher ratings to instructors than more junior level students (e.g. 1st year, 2nd year) (see Aleamoni and Hexner, 1980). Marsh (1997) states that "graduate-level courses are rated somewhat more favourably [and that] weak, inconsistent findings suggest upper division courses are rated higher than lower division courses" (p. 1194). This is probably not surprising in that students in smaller higher level courses are likely to be more dedicated and knowledgeable about the area of instruction and to receive more personal interactive forms of instruction. Aleamoni (1987) concludes that the level of the course should be considered when reviewing student evaluations.
Again, psychometric analysis using the proposed University of Calgary Universal Student Ratings of Instruction Instrument revealed "no differences between classes of different type [or] level...on any of the 12 items" (Kline and Rever-Moriyama, 1998, p. 17).
Rank of the instructor appears to have little consistent effect on student ratings. Arreola (1995) cites 5 studies that show that higher ranked instructors received higher ratings and 5 studies that report no meaningful correlation between rank of the instructor and student ratings (see Arreola, 1995). Similarly, Aleamoni (1987) comments that there are some studies that report correlations between instructor rank and student ratings, but both researchers agree that no consistent pattern has appeared in the literature. Again, Marsh and Roche (1997) state that there have been "mixed findings but little or no effect" (p. 1194).
There is little direct evidence regarding this issue. Two studies examined teacher's social-political attitudes and ideologies (Bausell & Magoon, 1972 cited in Feldman, 1987; Wilson et al., 1975) and found no relationship between these characteristics and ratings of teaching effectiveness.
In addition to these specific studies, numerous studies have examined professorial personality and attitudes and how these relate to student ratings. In a comprehensive review of the relation between professor personality and attitudes, Feldman (1986) found that professors' perceptions of their own personality were not related to student ratings, but student perceptions of professorial personality were. Interestingly, colleague ratings of personality were more strongly related to student ratings of effectiveness than to self-reported personality. Erdle, Murray and Rushton (1985) provided preliminary evidence that the relationship between personality and student ratings may be mediated by classroom behaviors.
Related to the personality research are studies examining attitude similarity. In a study comparing course evaluations with differences between perceived professor characteristics, and current and ideal self, Thomas, Ribich and Freie (1982) found that, as predicted, students whose current and ideal selves were closer to their perceptions of the professor also rated the course and professor more highly. Relationships between the ideal self and professor were stronger than those between current self and professor. In a similar study, Abrami and Mizener (1985) had students rate their own attitudes and perceived professor attitudes on a variety of topics. They found that although there was a significant relationship between perceived similarity and both ratings of effectiveness and course grade, these relationships all but disappeared when professor effects were controlled for. In other words, perceived similarity, course grades and student ratings were all predicted most efficiently by who the instructor was. The authors conclude that "the validity of student ratings is not substantially affected by student/instructor attitude similarity". (p.701). Other research using similar methodology also supports this conclusion (Tollefson, Chen & Kleinsasser, 1989). Concerning attitudes, Feldman (1987) reports that professors who are perceived as more committed to undergraduate teaching and more student-centered in their approach tend to receive higher ratings.
One study examined the effects of revealing sexual orientation (riddle, 1997). The author taught four sections of the same course and revealed her lesbian orientation to two sections and not the others. She found no differences in either mean student ratings nor class variance. Midterm evaluation (prior to her "coming out") was used as a covariate.
Locally, the concerns over personal attitudes and political views appear to be slight at best. In the survey of University of Calgary faculty conducted as part of the Phase I psychometric evaluation of the (COE, 1997), this issue was not raised by any of the 83 faculty participants.
In conclusion then, research from a variety of sources and examining a variety of attitudes suggests that personality, attitudes (whether political, sexual or regarding teaching) play a negligible role in determining student ratings. If there is an effect it is mediated through either classroom behavior which is related to teaching effectiveness, or perceived similarity, which means there is not a consistent effect for all students.
Some faculty believe that the workload and the difficulty of the courses they teach have a significant effect on the ratings they receive. The research on this particular variable, however, has produced some surprising results. Marsh (1987) found that "higher levels of Workload/Difficulty were positively correlated with student ratings" (p. 316) and therefore did not constitute a bias (see studies by Reedman and Stumpf, 1978; Frey et al., 1975; Pohlman, 1972 all cited in Marsh, 1987). In his 1997 overview, Marsh concludes that "harder, more difficult courses requiring more effort and time are rated somewhat more favorably" (p. 1194). The findings of the psychometric analysis found no significant correlations between perceived workload of the course and student ratings of instruction using the proposed University of Calgary 12-item instrument.
Researchers have also studied the effects of other potential biases such as required versus elective courses and academic discipline. The literature supports the belief that elective courses are rated more highly than required courses (Arreola, 1995; Marsh and Roche, 1997). In addition, according to Marsh and Roche, courses in the sciences appear to be rated lower than courses in the humanities, but they describe this as a "weak tendency" and suggest that there have not been enough studies done to draw any firm conclusions. In summary, Marsh and Roche (1997) agree with McKeachie (1990) who, in turn, points out that although there are a number of variables that could potentially bias student ratings of instruction, these variables have little effect. McKeachie says:
"Potentially contaminating variables such as...class size, or required versus elective classes, make a difference, but not a large enough difference to cause researchers to misclassify a good teacher as "poor." Although one should also get evidence from other sources if a teaching evaluation is to lead to an important personnel decision, student ratings are the best validated of all the practical sources of relevant data" (p. 195).
The psychometric analysis of the proposed University of Calgary 12-item instrument by COE (Kline and Rever-Moriyama, 1998) evaluated the effects on student ratings of a number of variables including: age of student, number of half-courses taken, percent of classes attended, class type, team versus non-team taught, and number of times the instructor had taught the course previously. None of these variables had a significant impact on student ratings. The study did show that there was a correlation between the global item (item 1) which reads, "The instruction in this course was: " and the number of university-level teaching awards won by the instructor. The study concludes that "ratings for award-winning instructors were higher than those for other instructors" (Kline and Rever-Moriyama, 1998, p. 16). This is an expected outcome of a valid ratings of instruction instrument. Another finding of the psychometric analysis focused on perceived faculty differences. Item 10, "Students' work was graded promptly" was the only item that varied across Faculties. Follow-up tests comparing the Faculties individually indicated that instructors from the Faculty of Management "were rated lower on grading work promptly than those from the Faculties of Humanities, Science or Social Science", and "those from the Faculty of Science were rated lower...than those from the Faculty of Humanities" (p. 17). No differences between Faculties were significant for any other items of the Instrument. Based on an analysis of data gathered at the University of Calgary, the COE report found that the Universal Ratings of Instruction Instrument is "psychometrically sound", stating that:
[t]he items have an underlying thematic structure that is reasonable, are internally consistent, are capable of discriminating performance on different aspects of instruction, and related appropriately to the external criterion of teaching awards. On the whole, the items on the Instrument were almost unaffected by student or instructor demographic data, course type or Faculty (p. 17).
To determine the effects of student ratings on the quality of leaching, Murray (1996) reviewed research evidence from three different sources: faculty surveys, field studies, and longitudinal comparisons.
Although the impact of student ratings on instructional quality is not assessed directly by faculty surveys, they do provide a useful index of instructor beliefs regarding the issue. Murray (1996) reviewed the results from eight published surveys of faculty opinion from across the United States and Canada which included either one or both of the following questions: "Do student ratings provide useful feedback for improvement of teaching?" and "Have student ratings led to improved teaching?" (Murray, 1996, p. 5). Although the findings differed somewhat between studies, generally, faculty participants agreed that student ratings do lead to improvement in teaching (p. 5). In fact, "across all surveys reviewed...and with differential weighting according to sample size, 73.4% of respondents said that student evaluation provided useful feedback, and 68.8% said that student evaluations have led to improved teaching" (p. 5).
A study conducted by McKeachie et al. (1980 cited in Murray, 1996) compared different groups of teachers who, half-way through the semester, received either a) a computer printout of student ratings; b) a printout of student ratings plus individual consultation with a faculty development "expert" who provided support and explicit suggestions for improvement or c) no student ratings feedback (Murray, p. 7). These conditions produced significant differences in their ratings at the end of the semester. The "feedback-plus-consultation" group received the highest ratings, the feedback-only group received the next highest ratings and the no-feedback control group received the lowest ratings. These findings led the investigators to conclude that "student feedback alone led to modest improvement in perceived quality of teaching, whereas student feedback supplemented by expert consultation produced much larger gains in teaching" (p. 7).
Murray (1996) also cites meta-analyses of field experiments carried out by Cohen (1980) and Menges and Brinko (1986) that reached similar conclusions. Based on these findings, Murray has concluded that field experiments suggest that student rating feedback alone "leads to a modest improvement in faculty teaching performance," and student rating feedback "supplemented either by expert consultation or by clarification of specific teaching behaviours leads to more substantial gains in quality of teaching" (p. 9).
Comparisons of mean student rating scores longitudinally over a number of years after student evaluations have been used in a particular department or faculty have also been used to assess the long-term effects of rating feedback on teaching effectiveness. Murray (1996) notes that this approach is based on the assumption that if student ratings do contribute to the improvement of teaching, then the effect should be reflected "in a gradual increase across years in the average teacher rating score of participating faculty members" (p. 11). The published research on longitudinal studies has produced mixed results. Some studies find a longitudinal improvement in mean student ratings for the department or faculty as a whole and some do not. Murray concludes that the mixed results are due to the fact the most of the studies have not fulfilled all the methodological conditions necessary to provide meaningful results (e.g. the mean ratings should be "compared across a minimum of 10 years or 10 semesters," the same student evaluation form should be employed for the duration of the study. See Murray, 1996, p. 11 for a detailed discussion).
Murray has summarized his findings related to the effects of student evaluations on the improvement of teaching into four general conclusions:
Other studies have also provided evidence that student evaluations contribute to the improvement of teaching. Wilson (1986 cited in Weimer & Lenze, 1997) conducted a study in which award-winning teachers were asked to characterize their teaching behaviours. Student evaluations were then carried out with a group of "teacher-clients." He consulted with his clients regarding their teaching evaluations and made specific concrete suggestions for improvement, including the teaching behaviours cited by the award-winning teaches. A second evaluation was conducted after an intervening semester. No difference was seen in the ratings received by the comparison group who received only student evaluation feedback but no consultation (Weimer & Lenze). For the teacher-clients who received such input, however, there was a "statistically important change in overall teaching effectiveness ratings for 52 percent of the faculty clients" (p. 209). Furthermore, the data suggested that the "items on which the greatest number of faculty showed statistically important change were those for which the suggestions were most concrete, specific and behavioral" (p. 209). Stevens and Aleamoni (1985 cited in Weimer & Lenze, 1997) similarly reported that "provision of consultation in addition to student ratings feedback resulted in an increase in student ratings that was maintained over time" (p. 303). These researchers suggest that more longitudinal research is needed in this area, and recommend that student ratings feedback "must be integrated with a system of instructor training and available instructional support services" (p. 303).
Few research data are available the use and impact of the publication of student rating information. Although many universities and colleges publish student ratings of teaching, few appear to have also published studies regarding the outcome of doing so. In a series of personal communications Schulz (1998), asked teaching development personnel from several Canadian universities about their concerns with rating publication. He reports that although some universities are concerned about possible legal implications of publication (e.g. York, U of Victoria), others report few faculty concerns about publication (e.g. Western). Several have even placed course ratings on the World Wide Web (Ottawa, Alberta, Queen's and Victoria). These informal data suggest that in Canada, publication of student ratings occurs widely and faculty members are generally not concerned. It should be noted, however, that many of these universities allow professors to opt out of publication. A survey of (N=547, 29% response rate) at the University of Texas at Austin examined faculty attitudes towards student ratings of teaching effectiveness (Curran, Koch, Svinicki & Lewis, 1983). Results indicated that faculty thought that the ratings were useful for teaching and course improvement, but less useful for student course selection. This is not surprising because the majority (60-75%) of these faculty did not allow their ratings to be published. Over a third of those surveyed (38%) were uncertain of the usefulness of the ratings for course selection, suggesting faculty are unaware of the extent to which ratings are used. Additionally, 25% thought the ratings were difficult for students to interpret. Overall, then, faculty in this sample were ambivalent about the usefulness of published ratings. The study did not examine the extent to which the views of this sample were representative of the larger campus community.
Although there are few data available on the topic, the concern that the published rating information will be viewed by inappropriate people may be more theoretical than practical. At least, experience from universities who currently publish ratings suggests that the outside community does not take notice of ratings (Galbraith, 1997).
Brickman (1976) suggested that publication of teaching evaluation be approached in a manner similar to any other publication. That is, he suggested that professors be allowed to choose which course ratings will be published and to add comments regarding what they had hoped to achieve in that course and what was achieved. He points out that this would still inform students as to course quality as those professors that chose not to publish would likely be regarded as poor professors or teaching worthless courses.
One of the concerns that has been raised regarding the publication of professor ratings is that these may create a self-fulfilling prophecy. That is, students are expecting a poorly or well taught course and they evaluate the professor accordingly. One empirical study examined this. Perry, Niemi and Jones (1974) found that students who first read an "anti-calendar" description of a highly praised professor rated the same professor, from whom they received a single academic lecture, more highly on student interaction and overall evaluation than students who first read a description of a below average professor. There were no differences in ratings of professor's teaching skills. This appears to be the only study of its kind and it is not clear the extent to which rating a new professor on just one lecture may generalize. The differences in evaluation were found only on aspects of teaching which the students had no information on (overall performance in comparison with other university professors and student interaction. In sum then, this study seems to indicate that in the absence of experience, students will rate professors in accordance with published ratings, but when they have first hand information, published ratings are not important. Also of note regarding this issue is that students often enter classrooms with preconceived ideas about the course and professor that are derived from the "grapevine" (Galbraith, 1997). The data from the grapevine is likely not as representative of student opinions as are the data from university evaluations.
Theoretically, the publishing of course ratings should aid students in choosing courses and professors. Moreover, the publishing of these ratings should encourage low-scoring professors to improve their teaching. However, some authors (e.g. Goldschmid, 1978; Perry et al., 1974) have speculated that publishing ratings could have the opposite effect. Professor with poor ratings may to attract less motivated and less academically oriented students who will in turn produce even lower ratings. There is no evidence to support this position. In addition, Goldschmid (1978) also notes that student "anti-calendars" may also "serve to move institutions and faculty to reconsider their policy and practice with regard to teaching" (p.232).
To what extent do students actually use ratings to select courses? In their review, Marshe & Durkin (1997) conclude that, in general, studies have found that students do use published ratings to aid in course selection and that ratings are useful. A related concern is that in choosing courses, students may misinterpret the information contained in the ratings. This can be avoided by the inclusion of information above and beyond the mean (Galbraith, 1997). The proposed system will include the percent of students in the class that responded, the distribution displayed as an easily interpretable graph, the standard deviation, information regarding class composition and a place for the professor to provide any information she/he feels is necessary. This information should reduce potential misinterpretation.
Locally, the University of Calgary Faculty of Law does publish course ratings which are available in the Law Library. There are no hard data available on the extent to which students access this information or the degree to which course selection is affected by doing so. Evidence from the President of the Society of Law Students suggests that students do not access the information because: 1) there are rarely multi-section courses and 2) students gather the information informally from other students. In a survey of University of Calgary faculty concerns about student evaluation (COE, 1997), 27 of 69 respondents mentioned the issue of publication. Fifteen of them felt that ratings should be published; twelve felt that they should not be.
McKeachie (1997), one of the leading scholars in student evaluations research, commented on a number of important factors with respect to student evaluations that have not been discussed in this review. For example, he suggests that a variety of student ratings forms are necessary in order to account for the differences between the different modes of teaching occurring today (e.g. the increasing use of technology, virtual universities, and cooperative learning). He also points out that researchers "need to study what teachers can do to help students become more sophisticated raters" (p. 1223). Most importantly, McKeachie argues for more research "on how to train members of personnel committees to be better evaluators, and research is needed on ways of communicating the results of student evaluations to improve the quality of their use" (p. 1223).
As noted numerous times throughout this review, the literature clearly demonstrates that student rating forms that are psychometrically sound, are reliable, valid, relatively free from bias, and useful in improving teaching. d'Apollonia and Abrami (1997b) cite Scriven's (1988) general conclusion that "student ratings are not only a valid, but often the only valid way to get much of the information needed for most evaluations" (p. 19). Marsh and Dunkin (1997) conclude that despite "ill-founded fears" on the part of the faculty, and claims based on research "fraught with methodological weaknesses...[t]he bulk of the research, however, has supported their [student ratings of instruction] continued use as well as advocating further scrutiny" (p. 311-312).
Although, not much research has studied the impact of the publication of student rating information, the degree to which problems might occur appears to be related more to the specifics of the publication process, than to publication in general. Available findings indicate that by and large published ratings are used by students in course selection, that the information is of little interest to other parties, and that the majority of faculty in institutions where publication occurs have few difficulties with the process.
Abrami, P. C., d'Apollonia, S. & Cohen, P. A. (1990). The validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology, 82, 219-231.
Abrami, P.C. & Mizener, D.A. (1985). Student/Instructor attitude similarity, student ratings and course performance. Journal of Educational Psychology, 77(6), 693-702.
Abrami, P. C., Leventhal, L. & Perry, R. P. (1982). Educational seduction. Review of Educational Research, 52, 446-464.
Aleamoni, L. M. (1987). Typical Faculty concerns about student evaluation of teaching. In Techniques for evaluation and improving instruction. New Directions for teaching and learning L. M. Aleamoni (ed.), no. 31. San Francisco: Jossey-Bass.
Aleamoni, L. M., & Hexner, P. Z. (1980). A review of the research on student evaluation and a report on the effect of different sets of instructions on student course and instructor evaluation. Instructional Science, 9, 67-84.
Aleamoni, L. M., & Yimer, M. (1973). An investigation of the relationship between colleague rating, student rating, research productivity, and academic rank in rating instructional effectiveness. Journal of Educational Psychology, 64, 274-277.
Arreola, R A. (1995). Developing a comprehensive faculty evaluation system: A handbook for college faculty and administrators on designing and operating a comprehensive faculty evaluation system. Bolton, MA: Anker Publishing Co.
Brickman, P. (1976). Publication as a model for teacher and student evaluation. Teaching of Psychology, 3(1). 31-32.
Bausell, RB. & Magoon, J. (1972). The validation of student ratings of instruction: An institutional research model. Newark: Delaware: College of Education, University of Delaware.
Brown, J. (1998). 10 ways to get better student ratings: 2 that may actually work. Core Issues, 8, 1-7. (York University newsletter).
Centra, J. A. (1975). Colleagues as raters of classroom instruction. Journal of Higher Education, 46, 327-337.
Centra, J. A. (1993). Reflective faculty evaluation: Enhancing teaching and determining faculty effectiveness. San Francisco: Jossey-Bass.
Creating Organizational Excellence (1997). Universal Student Ratings of Instruction. Phase I. University of Calgary. Author.
Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta--analysis of mulitsection validity studies. Review of Educational Research, 51, 281-309.
Costin, F., Greenough, W. T., & Menges, R. J. (1971). Student ratings of college teaching: Reliability, validity, and usefulness. Review of Educational Research, 41, 511-535.
Curran, L.T., Koch, W.K., Svinicki, M.P. (1983). Faculty opinion of the course-instructor evaluation program at U. T. Austin. ED250347, Texas: University of Texas Press.
d'Apollonia, S. & Abrami, P. C. (1997). Navigating student ratings of instruction. American Psychologist, 52, 1198-1208.
d'Apollonia, S. & Abrami, P. C. (1997b). In Response. Change, September/October, 18-19.
Demming, W. E. (1972). Memorandum on teaching. American Statistician, 26, 47.
Doyle, K. O. & Whitely, S. E. (1974). Student ratings as criteria for effective teaching. American Educational Research Journal, 11, 259-274.
Erdle, S. Murray, H.G. & Rushton, J.P. (1985). Personality, classroom behavior and student ratings of college teaching effectiveness: A path analysis. Journal of Educational Psychology, 77(4), 394-407.
Feldman, K.A. (1986). The perceived instructional effectiveness of college teachers as related to their personality and attitudinal characteristics: A review and synthesis. Research in Higher Education 24(2), 139-213.
Feldman, K. A. (1989). The association between student ratings of specific instructional dimensions and student achievement: Refining and extending the synthesis of data from multisection validity studies. Research in Higher Education, 30, 137-194.
Frey, P. W. (1978). A two-dimensional analysis of student ratings of instruction. Research in Higher Education, 9, 69-91.
Galbraith, P. (1997). Student evaluation of instruction: research implications and potential application. Students' Union, University of Calgary.
Goldshcmid, M. L. (1978). The evaluation and improvement of teaching in higher education. Higher Education, 7, 221-245.
Greenwald, A. G. & Gillmore, G. M. (1997). Grading leniency is a removable contaminant of student ratings. American Psychologist, 52, 1209- 1217.
Greenwald, A. G. (1997). Validity concerns and usefulness of student ratings of instruction. American Psychologist, 52, 1182-1186.
Grush, J. E. & Costin, F. (1975). The student as consumer of the teaching process. American Educational Research Journal, 12, 55-66.
Kline, T. & Rever-Moriyama, S. (1998). Universal Student Ratings of Instruction. Phase II, Volumes 1 and 2. Report by Creating Organizational Excellence (COE), University of Calgary.
Liddle, B.J. (1997). Coming out in class: Disclosure of sexual orientation and teaching evaluations. Teaching of Psychology, 24(1). 32-35.
Linsky, A. S. & Straus, M. A. (1975). Student evaluations, research productivity, and eminence of college faculty. Journal of Higher Education, 46, 89-102.
Marsh, H. W. (1977). The validity of students' evaluations: Classroom evaluation of instructors independently nominated as best and worst teachers by graduating seniors. American Educational Journal, 14, 441-447.
Marsh, H. W. (1984). Students' evaluations of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707-754.
Marsh, H. W. (1987). Students' evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11, 253-388.
Marsh, H. W. & Overall, J. U. (1979). Midterm feedback from students: Its relationship to instructional improvement and students' cognitive and affective outcomes. Journal of Educational Psychology, 71, 856-865.
Marsh, H. W. & Roche, L. A. (1997). Making students' evaluation teaching effectiveness effective: The critical issues of validity, bias, and utility. American Psychologist, 52, 1187-1197.
Marsh, H. W. and Dunkin, M. J. (1997). Students' evaluations of university teaching. In R. Perry and J. Smart (eds.), Effective Teaching in Higher Education: Research and Practice. New York: Agathon Press.
Marsh, H. W. & Ware, J. E. (1982). Effects of expressiveness, content coverage, and incentive on multidimensional student rating scales: New interpretations of the Dr. Fox effect. Journal of Educational Psychology, 74, 126-134.
McKeachie, W. J. (1979). Student ratings of faculty: A reprise. Academe, 65, 384-397.
McKeachie, W. J. (1990). Research on College Teaching: The Historical Background. Journal of Educational Psychology, 82, 189-200.
McKeachie, W. J. (1997). The validity of use. American Psychologist, 52, 1218- 1225.
McKeachie, W. J., Lin, Y. & Mendelson, C. N. (1978). A small study assessing teacher effectiveness: Does reaming last? Contemporary Educational Psychology, 3, 352-357.
Murray, H. G. "Does Evaluation of Teaching Lead to Improvement of Teaching?" Submitted to International Journal of Academic Development, 1996.
Murray, H. G. (1983). Low inference classroom teaching behaviors and student ratings of college teaching effectiveness. Journal of Educational Psychology, 71, 856-865.
Murray, H. G. (1984). The impact of formative and summative evaluation of leaching in North American universities. Assessment and Evaluation in Higher Education. 9, 117-131.
Perry, R.P., Niemi, R & Jones, K. (1974). Effect of prior teaching evaluations and lecture presentation on ratings of teacher performance. Journal of Educational Psychology, 66(6), 851-856.
Phillips, P. (1998). Student views of student evaluations of teaching. Core Issues, 8, 9-11. (York University newsletter).
Thomas, D., Ribich, F. & Freie, J. (1982). The relationship between psychological identification with instructors and student ratings of college courses. Instructional Science 11(2), 139-154.
Tollefson, N., Chen, J.S. & Kleinsasser, A. (1989). The relationship of students' attitudes about effective teaching to students' ratings of effective teaching. Educational and Psychology, 49(3). 529-536.
Weimer, M. & Lenze, L. F. (1997). Instructional interventions: A review of the literature on efforts to improve instruction. In R. Perry and J. Smart (eds.), Effective Teaching in Higher Education: Research and Practice. New York: Agathon Press.
Williams, W. M. & Ceci, S. J. (1997). "How'm I doing?" Problems with student ratings of instructors and courses. Change, September/October, 13-23.
Wilson, R. (1998). New research casts doubt on value of student evaluations of professors. The Chronicle of Higher Education, January, A12-A14.
Wilson, R.C., Gaff, J.G., Dienst, E.R, Wood, L., Bavry, J.L. (1975). College professors and their impact on students. New York: Wiley.
Surveys are anonymous!
Survey Dates
Within last three weeks of class.