The Challenge

In thinking about reflective coaching feedback, we have asked these questions repeatedly:

  • Once writing instructors (especially GTAs) have been trained to use reflective coaching feedback, do they actually implement it? How often do they use it compared to copy-editing or prescriptive/directive feedback?
  • Do the types of comments instructors give change over time?
  • How long does it take for instructors to make the switch?

Instructor self-reporting is insufficient to answer these questions with any confidence. More direct assessment methods are needed.

 

Our Approach

We evaluated GTA comments on student writing by extracting their comments from graded reports then classifying the comments using qualitative coding. The procedures we used to extract GTA comments and build our classification codebook, as well as the codebook itself, may be useful to others who want to conduct similar assessments.

 

Lessons Learned

Manual Classification

After a single training experience,  GTAs did not incorporate more reflective feedback into their grading comments. Around 50% of all comments focused on technical flaws, and ~67% of comments were either copy edits (29%) or specific instructions (38%). Only 3.5% of all comments gave reflective "holistic advice that could transfer to other situations.

In response we revised our grading policies so GTAs now must provide each student with a global summary statement to accompany their report score. The summary statement must point out the three most important errors to address in revision, and ideally include at least one reflective question. Summary comments are recorded with report grades in the campus LMS, making it easier for supervising faculty to assess GTA compliance.

Requiring at least one reflective coaching comment appears to have increased uptake of reflective coaching overall. This anecdotal observation needs to be confirmed in the near future.

 

Automated Comment Classification

Coding GTA comments was very informative for us, but is unsustainable because:

  • Developing the codebook then rating the initial set of 11,000 comments required over 200 hours of investigator effort.
  • For a mid-sized lab program like our own, GTAs generate 10-12,000 comments EACH semester, which would require 50-100 dedicated hours each semester.
  • Coders must be trained to achieve sufficient inter-rater reliability, adding to the time required.
  • There is significant risk of "coding drift" (changes in how code features are interpreted by a single rater over time).

To address these limitations we conducted a pilot study to determine if supervised text classification using naive Bayes could replace manual qualitative coding. Briefly we:

  1. Compiled an anonymized testing dataset of hand-coded TA comments from reports.
  2. Sorted comments by data structure, n-gram frequencies, etc., and identified potential elements for feature engineering.
  3. Wrote a supervised text classification workflow using Naive Bayes to assign TA comments to pre-defined categories.
  4. Tested permutations of analysis parameters to optimize the classifier and establish baseline accuracy.
  5. Applied the optimized NB classifier to the original FULL comment dataset, identified which comments were being classified incorrectly, and looked for potential patterns in errors.
  6. Based on features identified in Step #3 wrote a small set of rule- or REGEX pattern-based searches that could identify comments that are more likely to be classified incorrectly.
  7. Combined the rules/pattern-based pre-screening process with the optimized Naive Bayes classifier to create an Ensemble classifier.
  8. Re-validated the Ensemble classifier against a subset of ~2000 independently hand-coded TA comments extracted from student lab reports for a different semester than the initial test dataset.
     

Key Findings

Overall, the Ensemble classifier achieved similar results as hand coding for determining the subject of a GTA's comment, but it failed to classify comment structure (i.e., copy edit vs. specific recommendation) correctly.

We have not abandoned the idea of automated classification. There are many other potential classification strategies that have yet to be evaluated.

 

Available Project Resources

Resources Links
  Graders' Guide to Reflective Feedback

  DOCX file

  Annotated R script to extract comments from .DOCX files

  Link to web site

  Codebook development strategy and final codebook

  R/MD file
  DOCX file
  Link to GitHub repository

  Automated comment classification pilot project

  Link to web site
  Link to GitHub repository

 

Looking Ahead

  • We continue to evaluate other text classification methods with the goal of improving accuracy of automated analyses of GTA comments.
  • As was stated above, requiring at least one reflective coaching comment appears to have increased uptake of reflective coaching overall. Once we identify a more robust classifier, we aim to reassess the extent to which GTAs incorporate reflective coaching comments in student feedback before vs. after implementing mandatory global summary statements as part of grading. 

Check the list of To Do items in the Coaching-Oriented Feedback sub-project for more information about specific work in progress. Let us know if you want to contribute to one or more associated projects, or have other resources you would like to contribute.

 


Where to Learn More

  1. Balfour, S. P. 2013. Assessing Writing in MOOCs: Automated Essay Scoring and Calibrated Peer Review. Research & Practice in Assessment. 8:40-48. https://www.rpajournal.com/dev/wp-content/uploads/2013/05/SF4.pdf

  2. Ha, M., and R. H. Nehm. 2016. Predicting the Accuracy of Computer Scoring of Text: Probabilistic, Multi-Model, and Semantic Similarity Approaches. Presentation at NARST Meeting, Baltimore, MD.

  3. Kaplan, J. J., K. C. Haudek, M. Ha, N. Rogness, and D. G. Fisher. 2014. Using Lexical Analysis Software to Assess Student Writing in Statistics. Technology Innovations in Statistics Education 8. 10.5070/T581020235.

  4. Ruegg, R. 2015. Differences in the Uptake of Peer and Teacher Feedback. RELC Journal 46 (2): 131–45.

  5. Shermis, M. D., J. Burstein, D. Higgins, and K. Zechner. 2010. Automated Essay Scoring: Writing Assessment and Instruction. Automated Essay Scoring: Writing Assessment and Instruction 4: 20–26.

  6. Urban-Lurain, M., M. M. Cooper, K. C. Haudek, J. J. Kaplan, J. K. Knight, and P. P. Lemons. 2015. Expanding a National Network for Automated Analysis of Constructed Response Assessments to Reveal Student Thinking in STEM. Computers in Education Journal 6: 65–81.

  7. Weston, M., K. C. Haudek, L. Prevost, M. Urban-Lurain, and J. Merrill. 2015. Examining the Impact of Question Surface Features on Students’ Answers to Constructed-Response Questions on Photosynthesis. CBE Life Sciences Education 14. doi:10.1187/cbe.14-07-0110.

  8. Zhang, P., Huang, X., Wang, Y.,  Jiang, C., He, S. and Wang, H. 2021. Semantic Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion. IEEE Access. PP. 1-1. 10.1109/ACCESS.2021.3049378.

 

 

Created by Dan Johnson Last Modified Tue June 14, 2022 11:55 am by Dan Johnson

Comments

There are no comments on this entry.