STUDENTS RATHER THAN EXPERTS: A NEW AI FOR EDUCATION PIPELINE TO MODEL MORE HUMAN-LIKE AND PERSONALISED EARLY ADOLESCENCES
authors: Yiping Ma, Shiyu Hu, Xuchen Li
paper: https://arxiv.org/pdf/2410.15701
video: https://www.youtube.com/watch?v=fr_zUWaKDF8
code: https://marsgemini.github.io/SOE-LVSA/
- Abstract
- 1 INTRODUCTION
- 2 RELATED WORK
- 3 SCENE: BASIC CHINESE UNDERSTANDING ABILITY IN EDUCATION
- 4 OBJECT: VIRTUAL STUDENT CONSTRUCTION. 4.1 THEORETICAL FRAMEWORK
- 5 EVALUATION: ANALYSIS OF OBJECT CAPABILITIES IN SCENES
- 6 CONCLUSION
- A COMPREHENSIVE RELATED WORK
- B DETAILED INFORMATION FOR THEORY AND TECHNOLOGY USED IN THIS WORK
- C DETAILED INFORMATION FOR DATASET CONSTRUCTION PROCESS
- D DETAILED EXPERIMENT SETTINGS AND RESULTS
AI in Education (AI4Education)
- Large language models (LLMs) used for educational applications
- Focus on simulating teachers to enhance student learning outcomes
- Neglected potential of modeling virtual students
- Challenges: replicating learning difficulties, emotional responses, linguistic uncertainties
Proposed Framework: SOE (Scene - Object - Evaluation)
- Systematic construction of LLM-based Virtual Student Agents (LVSA)
- Curated dataset of personalized teacher-student interactions
- Fine-tuning LLMs using LoRA
Study Objectives:
- Generate LVSA framework
- Integrate human evaluation metrics into GPT-4 assessments
- Validate LLMs for generating human-like, personalized virtual student agents
Methodology:
- Curating dataset of teacher-student interactions
- Fine-tuning LLMs using LoRA
- Multi-dimensional evaluation experiments
- Develop theoretical framework for LVSA generation
- Human evaluation of LVSA authenticity
- Validate LLMs in educational contexts
Significance:
- Foundation for future applications in pre-service teacher training and multi-agent simulation environments.
Introductions:
- Recent advancements in LLMs have led to breakthroughs in NLP and knowledge generation, enabling AI-driven educational applications (Ge et al., 2024; Liu et al., 2024b; Hong et al., 2024; Latif et al., 2023; Chen et al., 2024a; Jury et al., 2024).
- LLMs have the potential to be virtual tutors in generating high-quality content and personalized recommendations (Lee et al.; Wang et al., 2024).
Challenges:
- Traditional pre-service teacher training faces constraints like limited environments and insufficient practical experience (Loewenberg Ball & Forzani, 2009; Zeichner, 2012; Ward et al., 2018; Sparks, 2011).
- Virtual students offer cost-effective, flexible practice opportunities but often have predictable behaviors that fail to capture the complexity of human emotional responses and cognitive variability (Badiee & Kaufman, 2015).
- Existing models lack realistic virtual student modeling, raising a critical question: "Can LLMs create human-like, personalized virtual student agents for teacher training?"
Framework:
- The proposed "Scene - Object - Evaluation" (SOE) framework addresses the challenges of modeling and evaluating LLM-based virtual student agents (LVSA).
Contributions:
- Theoretical framework for LVSA: A comprehensive framework for constructing virtual student agents with scientific rigor and feasibility.
- Subjective evaluation metrics integration: Incorporating human subjective metrics into GPT-4’s evaluation pipeline, aligning with human assessments of virtual student authenticity.
- LVSA validation: Extensive multi-dimensional experiments confirming the feasibility of creating human-like, personalized LVSA for educational AI.
Conclusion: The SOE framework provides a robust foundation for applications in pre-service teacher training, multi-agent simulations, and educational AI systems.
Virtual Student Development and Related Technologies
Three Approaches to Virtual Students:
- Role-playing: Realistic but lacks scalability and flexibility due to resource constraints (Kersh, 1963; Colwell, 2013; Frederick et al., 2010; Shapira-Lishchinsky, 2015; Dalgarno et al., 2016).
- Programmatically pre-set models: Provide predictable responses but lack adaptability to real-world classroom dynamics (Christensen et al., 2011; Delamarre et al., 2021; Shernoff et al., 2018; Kelleci & Aksoy, 2021).
- LLM-based virtual students: Offer greater flexibility in natural language interactions (Achiam et al., 2023; Li et al., 2024c; Zhang et al., 2024; Lee et al.; Markel et al., 2023).
LLMs' Capabilities for AI4Education:
- Enhanced natural language processing abilities: Understanding, generation, reasoning, transfer learning, and multimodal processing (Brown, 2020; Achiam et al., 2023; Guo et al., 2023; Li et al., 2024b; Huang et al., 2024; Ahuja et al., 2023; Kafle & Kanan, 2017).
- Models like GPT-4, LLaMA, and PaLM demonstrate remarkable performance (Touvron et al., 2023; Chowdhery et al., 2023; Wang et al., 2019; Clark et al., 2018).
- Struggle to replicate nuanced cognitive and emotional behaviors of real students due to focus on knowledge delivery (Achiam et al., 2023; Li et al., 2024c; Zhang et al., 2024; Lee et al.; Markel et al., 2023).
Representative AI4Education Applications:
- Duolingo Max, Khan Academy’s Khanmigo, and TAL’s MathGPT: LLM-based products to enhance student learning (Duolingo, 2023; Khan, 2023; TAL, 2023a; squirrelai, 2023; Chen et al., 2021).
- C-Eval, SciQAG, and Mukea: Benchmark datasets and evaluation tools for LLMs’ effectiveness in education (Huang et al., 2024; Wan et al., 2024; Ding et al., 2022).
Gap in Teacher Development Support:
- Most research prioritizes student learning over enhancing teachers' pedagogical skills (Lee et al.; Wang et al., 2024).
- LLM applications focus on automating educational tasks, leaving a gap in teacher development support (García-Méndez et al., 2024; Yue et al., 2024; McNichols et al., 2023; Sahai et al., 2023; Han et al., 2023).
Study's Novel Approach:
- Leverages virtual students to enhance teachers’ skills through realistic, interactive environments (for more details, see Appendix A.3).
Study Overview:
- Explores LLMs' capabilities in simulating realistic student behavior for early adolescents (ages 10-15)
- Critical developmental phase: transition from concrete to abstract thinking, with distinct cognitive and linguistic challenges
- Focus on junior high school Chinese education due to emphasis on expression, emotional experience, and value formation
Research Questions:
- Do foundation models possess basic comprehension abilities for middle school language subjects?
- Can LLMs accurately replicate the learning behaviors of early adolescents in real classroom settings?
Assessment:
- Evaluate foundation models' Chinese comprehension abilities to determine their potential.
Basic Chinese Understanding Ability Evaluation Dataset
Data Source: National Smart Education Platform (developed by the Chinese Ministry of Education)
Focus Skills:
- Text comprehension
- Memorization
Purpose: Assessing junior high school Chinese language proficiency
Construction Process:
- Data Preparation: Organizing Chinese exercises into units and converting them to PDF format for model input
- Prompt Design: Using structured prompts to guide model tasks
- Expert Revision: Junior high school teachers reviewed and refined model outputs for accuracy
Collected Items: 613 comprehension items, 438 memorization tasks
Detailed Information: Provided in Appendix C.1
Section 3: Foundation Models and Performance Evaluation
Foundation Models:
- InternVL: integrates visual and linguistic information for cross-modal reasoning
- LLaVa: balances efficiency with strong language understanding for medium-scale tasks
- MiniCPM: optimized for Chinese tasks, excels in comprehension and recitation
- Qwen: performs well in visual-linguistic and text-based contexts, offers flexibility for multimodal applications
Evaluation of Foundation Models:
- Performance on junior high school Chinese language tasks (text comprehension and memorization)
- InternVL:
- Comprehension accuracy: 0.736
- Memorization accuracy: 0.758
- LLaVa:
- Comprehension accuracy: 0.491
- Memorization accuracy: 0.397
- MiniCPM:
- Comprehension accuracy: 0.664
- Memorization accuracy: 0.584
- Qwen:
- Comprehension accuracy: 0.584
- Memorization accuracy: 0.614
Overall Performance and Comparison:
- InternVL achieved the highest average accuracy (0.747)
- MiniCPM followed with an average accuracy of 0.700
- Qwen and LLaVa had lower average accuracies: Qwen (0.599), LLaVa (0.444)
- The lower performance of Qwen and LLaVa may be due to their design focus on multimodal tasks, limiting their effectiveness in Chinese language processing
Conclusion:
- InternVL and MiniCPM are more suitable for junior high school Chinese tasks
- Qwen and LLaVa may have broader applications in multimodal educational contexts.
Theoretical Framework for Virtual Student Construction
Conceptual Theory:
- Early adolescence (ages 10-15): significant physiological, cognitive, social-emotional, and moral-spiritual changes
- Rapid prefrontal cortex development leads to intense emotions, inconsistent behaviors
- Cognitive transition from concrete to abstract reasoning, influenced by emotions
- Heightened self-awareness and sensitivity to peer feedback impact engagement
- Moral development involves internalizing societal norms, leading to simplified moral judgments
Operational Theory:
- Generate dialogues that are operational and predictable for realistic simulations of student performance
- Practical dimensions: question-answer types, personality traits, learning stages, response styles, generation sources
Question-Answer Types:
- Simulate cognitive diversity in virtual students
Personality Traits:
- Capture individual differences using the Big Five Personality Traits
Learning Stages:
- Reflect different cognitive abilities
Response Styles:
- Represent individual language differences
Generation Source:
- Combines techniques like Few-shot In-Context Learning (Few-shot ICL) and Chain of Thought (CoT) with fine-tuning on classroom data to maintain response authenticity.
Figure 3:
- Theoretical framework for virtual student construction, extending from conceptual theory to operational theory.
Fine-Tuning Dataset
Data Source:
- Aligned with Basic Chinese Understanding Ability Evaluation Dataset described in Section 3.1
- Incorporates data from:
- Real classroom video recordings
- Textbook content
- Teacher-prepared lesson plans
Dataset Construction:
- Stages:
- Data preparation
- Prompt design
- Expert revision
- Large-scale dialogue generation based on the Big Five personality traits
- Creation of fine-tuning datasets
- Incorporating Few-shot ICL enhanced model's ability to simulate distinct student personalities
Data Preparation:
- Selecting representative Chinese texts and transcribing classroom videos for authentic teacher-student dialogues
Prompt Design:
- Aligned with real classroom interactions to align with Big Five traits using GPT-4
- Enhanced virtual student personalization
Expert Revision:
- Ensured alignment before large-scale dialogue generation
- Resulted in dataset reflecting diverse personality-based interactions
Fine-Tuning Datasets:
- 5 distinct instruction fine-tuning datasets created, each corresponding to a specific personality trait
Dataset Analysis:
- Word cloud visualizations used to analyze students' expression styles:
- HE: Dominant words indicate frequent use of self-referential and positive language, strong inclination toward interaction and narration (Figure 5a)
- HN: Hesancy, anxiety, uncertainty (Figure 5b)
- LO: Conservative and restrained language, reliance on established knowledge structures (Figure 5c)
- HA: Cooperation, care, and empathy (Figure 5d)
- LC: Lack of precision, disorganized expression, vague language (Figure 5e)
- Findings align with existing research showing language style reflects cognitive abilities and personality traits
Additional Examples: (Refer to Appendix C.2.3 for more details)
Fine-tuning Foundation Models (4.3)
- Utilized a high-performance setup with eight A6000 GPUs to fine-tune the model for capturing linguistic traits of students with various personalities.
- SWIFT infrastructure, along with LoRA, was employed for efficient fine-tuning, tailoring hyperparameters to each LVSA type.
- This approach personalizes the model to reflect specific personality traits effectively.
- Detailed examples are available in Appendix D.2.
- Pre- and post-fine-tuning results showed significant improvements in linguistic capabilities, with fine-tuned models producing more aligned responses to targeted language styles and traits, demonstrating enhanced structure and coherence.
- Further evaluation of these improvements is discussed later.
Evaluation of LVSA
- Construct subjective evaluation dataset (Figure 1(e))
- Inference dataset creation
- Fine-tuned inference
- Direct inference
- Evaluation data reconstruction
- Evaluation process (Figure 1(f))
- Human evaluation (Section 5.2)
- Human-GPT-4 comparison evaluation (Section 5.3)
- Large-scale GPT-4 evaluation (Section 5.4)
Subjective Evaluation Dataset Construction
- Four-step process: Inference dataset creation, fine-tuned inference, direct inference, evaluation data reconstruction
- Inference dataset construction:
- GPT-4 generates teacher questions based on student personality traits using prompt engineering techniques from the fine-tuning process
- Includes two fields: system (teaching scenario and student traits), query (teacher questions)
- Fine-tuned inference:
- Student responses generated for five distinct personality types: HE, LO, HN, HA, LC
- Personalized outputs from fine-tuned models
- Direct inference:
- Assessing baseline capabilities of non-fine-tuned models
- Highlights improvements in personalization and realism after fine-tuning
- Subjective evaluation dataset reconstruction:
- Total of 12,312 responses generated across four foundation models and five personality types
- Covering different learning stages and question types before and after fine-tuning
- Human evaluation:
- 80 samples selected: 40 from fine-tuned models, 40 from direct inference, 35 real student responses as control group (total of 115 samples)
- Five used for fatigue test items
- 80 samples selected: 40 from fine-tuned models, 40 from direct inference, 35 real student responses as control group (total of 115 samples)
- GPT-4 evaluation:
- Remaining 12,232 responses used for GPT-4's large-scale evaluation
- 6,116 pre-fine-tuned and 6,116 post-fine-tuned responses
- Remaining 12,232 responses used for GPT-4's large-scale evaluation
Human Turing Test
Experiment Overview:
- Aimed to determine if LVSA could emulate real students' language expressions and cognitive abilities
- Human evaluators assessed whether generated dialogues resembled those of real students
Participants and Procedure:
- Participants, acting as "judges", used teaching experience to evaluate 120 teacher-student dialogues
- Participants verbalized thought processes during evaluation, providing insights into linguistic features influencing judgments
- Subsequent semi-structured interviews explored evaluation criteria, developing a scientific framework for optimizing future virtual students
Evaluation Results:
- Fleiss's Kappa: 0.6917, indicating substantial inter-rater agreement and strong consensus among participants
- Average recognition probability: LVSA exceeded 0.9 across multiple personality traits, demonstrating close resemblance to real students
- In some cases, virtual students surpassed real students in recognition rates, highlighting LVSA's potential to effectively emulate human-like language in educational contexts
GPT-4 Comparison Validation
Human-GPT4 Evaluation Differences:
- GPT-4 prompt engineering: Systematically extracted human evaluator interview criteria for Large-Scale Virtual Student Assessments (LVSA) and integrated into GPT-4 as prompts
- Two coders used ATLAS.ti software for a two-level coding process, resulting in 4 primary and 15 secondary codes covering:
- Emotional integration
- Cognitive level
- Psychological state
- Verbal expression
- High inter-rater reliability (0.876) confirmed the validity of these dimensions
- Coded dimensions were used to design prompts for GPT-4 using a chain-of-thought approach, enabling step-by-step evaluations closely aligned with human judgments
- Structured prompt design improved GPT-4's accuracy and ensured consistency with human evaluators
GPT-4 Evaluation Ability:
- GPT-4 achieved an average evaluation score of 0.978, with perfect alignment (score of 1) for certain personality traits
- Overall Fleiss’s Kappa of 0.6806 indicates substantial agreement with human evaluators, supporting GPT-4's reliability in educational evaluation
Comparison of Evaluation Capabilities:
- Table comparing GPT-4 and 10 human evaluators' assessment of LVSA agents with different personality traits
- GPT-4 demonstrated effective simulation of nuanced performance, with the highest accuracy for certain traits
Conclusion:
- The findings establish a foundation for GPT-4's future application in educational research and practice.
Evaluation of LVSA Performance: GPT-4 Analysis
Analysis of Different Personality Types:
- Pre-fine-tuning scores: HA (29.29%), HN (44.98%), HE (26.86%), LC (15.22%), and non-LC types (36.76%)
- Post-fine-tuning scores: all models showed substantial improvements, with an average score of 72.51%
- Statistically significant improvements for all models as indicated by paired t-tests
- Virtual students with HA demonstrated the most notable improvements, p-values below 0.001
- All personality types except LC showed significant gains post-fine-tuning
Evaluation of Different Learning Stages:
- Pre-fine-tuning scores: InternVL (54.53%), LLaVa (15.22%), MiniCPM (15.38%), Qwen (68.15%), and average performance of 54.51%
- Post-fine-tuning scores: significant improvements for all models with an average improvement of 37.28%
- Statistically significant differences as indicated by paired t-tests
- Virtual students effectively adapt to different learning stages, offering comprehensive support for pre-service teacher training
Evaluation of Different Question Types:
- Pre-fine-tuning scores: closed questions (54.51%) and open questions (38.00%)
- Post-fine-tuning scores: significant improvements for all question types, average improvement of 39.76%
- Paired t-tests showed statistically significant differences, p-values below 0.05
- Greater impact on open questions due to their higher complexity and lower initial performance.
Limitations of LLMs (Limited Learning Machines) in Role-Playing Contexts
Poor Fine-Tuning Performance of LC (Low Conscientiousness) Virtual Students:
- Sparse distribution of relevant data for low-conscientiousness behaviors in original training set
- Difficulty learning accurate traits due to underrepresentation
- Higher likelihood of "hallucination" - generated responses lacking semantic coherence
- Complication in modeling negative traits associated with antisocial performance
Inconsistent Fine-Tuning Effects Across Question Types:
- Improvements seen for closed and open-ended questions, but no statistically significant improvements across both types
- Structured nature of closed-ended questions makes them easier to manage
- Deeper reasoning and creative thinking required for open-ended questions, resulting in greater variability in performance
Suboptimal Fine-Tuning Performance of LLaVa:
- Overall performance weaker than others despite improvements in personalization, question types, and learning stages
- Differences between pre-training and fine-tuning data domains, particularly cross-language issues
- Adaptability and generalization to Chinese contexts constrained due to reliance on English pre-training data.
Conclusion:
- Introduced "Scenario-Object-Evaluation" (SOE) pipeline for creating realistic virtual student agents using LLMs
- Designed for pre-service teacher training, addressing limitations of current simulation platforms
- Multi-dimensional evaluation assessed personality traits, question types, and learning stages
- Comparative experiments with human evaluators showed similarity between LLM dialogues and real student behavior
- Future work: integrate multimodal tasks (images, videos), improve error generation mechanism, enhance low-conscientiousness simulations.
Virtual Student Development and Related Technologies
Role of Teachers:
- Central resource for educational development
- Pre-service teachers depend heavily on practical training for professional growth
Virtual Students as a Tool for Pre-service Teacher Training:
- Provide cost-effective, interactive, and flexible platform
- Simulate classroom interactions
Types of Virtual Students:
- Role-playing virtual students:
- Imitated by teachers or adults
- Offer real-time interaction but rely on participants' ability to accurately simulate student responses
- Resource-intensive, lacks scalability and automation
- Pre-defined virtual students:
- Use pre-defined algorithms and scripted rules to simulate student interactions
- Repeatable and operationally controlled, useful for consistent teaching tasks
- Rigidity limits their ability to adapt to real-world classroom dynamics
- Virtual students powered by Large Language Models (LLMs):
- Leverage advanced natural language processing capabilities to generate dynamic and contextually relevant student responses
- Introduce greater flexibility and linguistic variability but face challenges in replicating full range of student behaviors
- Demonstrated strong capabilities in natural language generation and knowledge understanding, but not yet fully mimicking language expression levels and cognitive challenges faced by real students
Goal of the Study:
- Explore whether LLMs can be leveraged to create human-like and personalized virtual student agents that better simulate student cognition and emotional responses
- Provide more authentic, dynamic, and adaptive learning interactions for pre-service teacher training.
LLMs' Capabilities for AI Education
Language Understanding:
- LLMs demonstrate exceptional performance in language processing tasks
- Ability to comprehend complex linguistic structures
- High scores on benchmarks like GlUE, SuperGLUE, SQuAD
- Crucial for building virtual students that respond appropriately to teachers' questions
Language Generation:
- Models like GPT-4 and T5 demonstrate strong language generation capabilities
- Generate natural language paragraphs, dialogues, and long-form text
- Maintain contextual consistency throughout the generation process
Language Reasoning:
- LLMs required to perform complex logical inference and commonsense knowledge reasoning
- Current top-tier models like GPT-4 and PaLM excel in handling more complex reasoning tasks
Knowledge Transfer:
- Language models can apply learned knowledge to other domains or language tasks
- Models demonstrate strong cross-linguistic capabilities, handling comprehension and reasoning across various languages
Multimodal Processing:
- Models like Flamingo and InternVL combine visual and textual processing capabilities
- Offer greater potential for future development of virtual students in multimodal learning environments.
Educational Technology Applications of Large Language Models (LLMs)
Duolingo Max:
- Leverages LLMs to enhance language learning through generative dialogue and intelligent feedback
- Helps students improve language skills via interactive exercises
Khan Academy's Khanmigo:
- Powered by LLMs
- Offers intelligent tutoring and personalized learning support, spanning primary to higher education
Google Socratic Platform:
- Assists students in solving problems across multiple disciplines
Chinese Edtech Companies:
- TAL Education's MathGPT: Provides step-by-step guidance in mastering core mathematical problem-solving techniques
- Youdao's "ZiYue": Prioritizes a "scenario-first" approach
- iFLYTEK's "Spark Desk": Conducts human-like interactive learning in various fields
- Squirrel AI: Develops a personalized learning system that tailors individual learning paths
Universities:
- MIT: Achieves 81% correctness on undergraduate-level math problems using LLMs and pre-trained models
- East China Normal University's EduChat: Offers functionalities such as open Q&A, essay grading, heuristic teaching, and emotional support
Benchmark Datasets:
- C-Eval: Evaluates the performance of LLMs across various subjects from primary to higher education
- SciQAG: Measures the model's ability to answer scientific questions
- Mukea: Benchmarks multimodal knowledge extraction and integration
Conferences and Research:
- NeurIPS and AAAI have seen a growing number of research papers on LLMs in education
- Topics include supporting individualized learning paths, providing personalized feedback, and education dataset construction
Future Work:
- Research is emerging on using LLMs to support teacher skills, particularly in areas like problem-solving, knowledge explanation, and automated grading
- This study aims to introduce a novel research direction: using virtual students to enhance teachers' pedagogical skills
LLMs for Junior High School Chinese Language Tasks
Selection of Representative Foundational Models:
- InternVL: Multimodal large language model integrating visual and linguistic processing capabilities, excels at cross-modal reasoning
- LLaVa: Lightweight yet powerful LLM based on Mistral architecture, specializes in text generation and language understanding
- MiniCPM: Optimized for Chinese language tasks, deep capabilities in understanding and generating Chinese
- Qwen: Supports multimodal interactions, excels in visual-linguistic exchanges and pure text-based tasks
Performance Evaluation:
- Evaluated for comprehension and recitation abilities, essential for constructing virtual student agents
- Assesses ability to process and understand Chinese text, specifically focusing on comprehension and recitation
- Provides foundation for future expansion into multimodal teaching environments involving both textual and visual processing
Foundation LLMs Repository URLs:
- InternVL2 (8B): Shanghai AI Laboratory
- LLaVa (7b-hf): Microsoft
- MiniCPM (8B): MiniCPM-V
- Qwen (vl-chat): Alibaba Cloud
Educational Theory: Characteristics of Early Adolescent Students
Physiological Development:
- Rapid development of prefrontal cortex (complex cognitive functions) vs. faster development of limbic system (emotional regulation)
- Balance leads to intense emotional responses, limited cognitive regulation, and variability in participation
Cognitive Development:
- Transition from concrete operational thinking to abstract logical reasoning
- Gradual ability to handle abstract concepts and logical analysis
- Easily influenced by emotions and external factors
- Repetitive and uncertain language expression, especially with abstract issues
Social-Emotional Development:
- Increased self-awareness and concern for social roles
- Frequent emotional fluctuations
- Heightened sensitivity affects classroom behavior (teacher feedback, peer reactions)
- Simulation of emotional dynamics crucial for accurate reflection of student responses
Moral and Spiritual Development:
- Beginning to develop own values and face complex moral issues
- Transition from external rule adherence to internalizing societal norms
- Moral judgments often emotionally driven and oversimplified
- Interpret moral situations through personal experiences
- Emerging critical thinking shows students analyzing multiple perspectives, but skills not yet fully mature.
Operational Theory: Classification of Teacher-Student Dialogue Datasets
Introduction:
- Importance of operational theory in generating realistic student behavior and language capabilities
- Foundation for generating diverse student behaviors and consistent responses across classroom scenarios
Dimensions of Classification:
- Question-Answer Type:
- Fundamental for generating virtual students
- Different types require various cognitive processes, influencing student language output and development
- Open-ended questions encourage creative thinking; closed-ended questions focus on recalling facts
- Careful design aids in simulating realistic classroom interactions
- Personality Traits:
- Critical dimension for personalized student modeling
- Based on the Big Five Personality Traits model
- Impact on student behavior, language expression, and creation of distinct personalities
- Enhances personalized expression and exploration of language patterns
- Low Openness (LO):
- Conservative and pragmatic
- Low receptivity to new content
- Lack of initiative in exploration
- Weaker ability to handle complex issues
- Characterized by simple, direct responses and difficulty expanding discussions
- Low Conscientiousness (LC):
- Careless behavior and lack of organization
- Inconsistent responses, including self-correction
- Lack of systematic thinking leading to disorganized responses
- Characterized by simplicity, occasional self-correction, and unreliability in language style
- High Extraversion (HE):
- Active participation and strong social skills
- Desire for engagement and expressing themselves confidently
- Fluent and confident language style with minimal hesitation or pauses
- High Agreeableness (HA):
- Cooperative and empathetic behavior
- Thoughtful, patient, and tolerant of others
- Warm and friendly language expression
- High Neuroticism (HN):
- Anxiety, nervousness, and emotional fluctuations
- Hesitant, repetitive backtracking in responses
- Disjointed speech patterns due to emotional instability
- Learning Stages:
- Reflects cognitive abilities and language expression development at different stages
- Accurate modeling of learning stages creates virtual students suitable for various instructional contexts
- Answering Style:
- Individual differences in students’ language expression
- High openness leads to more creative, exploratory language; high conscientiousness results in detailed and accurate responses
- Generation Source:
- Traditional prompt engineering, few-shot ICL, CoT, and LoRA methods used for generating realistic student dialogues
- Combination of these methods ensures that generated dialogues closely resemble real-life classroom interactions
Target Group for Virtual Student Modeling: Early Adolescents (Middle School Students)
Characteristics of Early Adolescents:
- Transitioning from concrete to abstract thinking
- Cognitive development not fully matured
- Weaker logical reasoning skills
- Rely more on intuitive perception and concrete experiences
- Unclear language expression and inconsistent thought processes
- Limited language proficiency
- Heightened uncertainty and repetition in verbal responses
Value of Studying Early Adolescents:
- Represents diverse classroom behaviors
- Challenges large language models on complex tasks
- Offers insights into language expression, emotion, and attitudes
Cognitive Processes Involved in Classroom Interactions:
- Focusing on the question
- Adequate attention and selective information processing
- Understanding the question's meaning
- Relating it to existing knowledge
- Providing a response
- Knowledge application
- Clear articulation of thoughts
Importance of Language-Based Subjects:
- Emphasizes language expression, emotional experience, and value cultivation
- Aligns with core capabilities of large language models (LLMs)
- Offers insights into linguistic styles, emotional expressions, and responses to complex questions.
Chinese Understanding Ability Evaluation Dataset: Junior High School Chinese Language
Data Source:
- National Smart Education Platform: authorized and standardized educational materials from Ministry of Education for teachers and students nationwide
- Integrates official junior high school Chinese language curriculum, including textbook passages, practice exercises, comprehension assessments, and recitation tasks
Focus Areas:
- Text Comprehension
- Text Memorization
Text Comprehension:
- Evaluate understanding of text structure, sentence organization, inference, vocabulary context, emotional tones
- Assesses ability to process and understand Chinese texts at junior high level
Text Memorization:
- Tests model's ability to recall and reproduce texts
- Essential for assessing long-term memory and linguistic retention in Chinese education
Data Preparation:
- Organize exercises by unit, maintaining core knowledge points and skill requirements
- Convert PDF format for model input
- Ensure integrity of data structure
- Clear framework for model processing
Prompt Design:
- Instruction-based learning method with structured prompts
- Provide clear task instructions for accurate outputs
- Optimize complex tasks, such as text comprehension and recitation
Expert Revision:
- Team of professional junior high school Chinese teachers reviews generated data
- Evaluates model's outputs for accuracy, consistency, alignment with educational standards
- Corrects errors to improve dataset quality
- Enhances dataset's applicability in educational settings
Text Comprehension Data Construction:
- Collect 613 text comprehension items through expert revision
- Improves overall dataset quality and scientific rigor
- Ensure alignment with junior high school Chinese education requirements
Text Memorization Data Construction:
- Create multiple-choice questions for students to assess specific content recall from the text
- Design questions based on original text without rewriting or modification
- Evaluate students' ability to remember content from recitation passages.
Personality Traits: Low Conscientiousness (LC)
Teacher's Expectations:
- Answering teacher's questions during class
- Exhibit traits of carelessness, inconsistency, and lack of systematic thinking
- Provide simple and direct answers that are occasionally self-corrected and unreliable
Strategy for Answering Questions:
- Focus on the question and identify the phase of the lesson (pre-lesson introduction, new lesson instruction, consolidation of new knowledge, classroom practice, or lesson summary)
- Understand the nature of the question - determine if it is open-ended or closed-ended
- If closed-ended, give a brief answer based on your knowledge
- If open-ended, answer according to your personality traits
Word Cloud Visualization of Big Five Personality Traits: LVSA Fine-Tuning Dataset Figure A10: Top 200 High Frequency Words (LC LVSA)
- The first 200 words are displayed, based on Chinese answers translated into English
Data Analysis: The Top 200 High Frequency Words (HA LVSA) The experiment is based on Chinese, with Chinese answers counted for word frequency and translated into English for display.
Conclusion:
- This section provides an analysis of the data from virtual students with low conscientiousness personality traits. The top 200 high frequency words are presented in a word cloud visualization and discussed.
Subjective Evaluation Dataset (SE Dataset)
- Table A3: Details of Subjective Evaluation Dataset
- Learning Stage:
- Big Five Type: HA, HE, LC, LO, HN
- Total: 16 questions per category for each type
- Pre-lesson Introduction:
- Closed Question: 0-3 (HA: 3, HE: 0, LC: 1, LO: 1, HN: 5)
- Open Question: 1-8 (HA: 2, HE: 3, LC: 1, LO: 4, HN: 1)
- New Lesson Instruction:
- Closed Question: 3-12 (HA: 3, HE: 3, LC: 1, LO: 2, HN: 5)
- Open Question: 2-8 (HA: 2, HE: 3, LC: 1, LO: 2, HN: 2)
- Knowledge Consolidation:
- Closed Question: 3-11 (HA: 3, HE: 3, LC: 1, LO: 2, HN: 5)
- Open Question: 5-14 (HA: 3, HE: 2, LC: 2, LO: 2, HN: 8)
- Classroom Exercises:
- Closed Question: 0-3 (HA: 0, HE: 1, LC: 1, LO: 1, HN: 3)
- Open Question: 5-14 (HA: 3, HE: 2, LC: 2, LO: 2, HN: 8)
- Lesson Summary:
- Closed Question: 0-2 (HA: 0, HE: 0, LC: 0, LO: 1, HN: 2)
- Open Question: 0-8 (HA: 0, HE: 3, LC: 0, LO: 3, HN: 5)
- Learning Stage:
- GPT-4-E Pre-lesson Introduction:
- Closed Question: 424-1816 (HA: 336, HE: 280, LC: 392, LO: 384)
- Open Question: 104-512 (HA: 144, HE: 96, LC: 72, LO: 96)
- New Lesson Instruction:
- Closed Question: 296-1528 (HA: 296, HE: 160, LC: 360, LO: 304)
- Open Question: 288-1080 (HA: 216, HE: 264, LC: 128, LO: 184)
- Knowledge Consolidation:
- Closed Question: 192-1064 (HA: 192, HE: 200, LC: 224, LO: 256)
- Open Question: 320-1296 (HA: 264, HE: 224, LC: 264, LO: 224)
- Classroom Exercises:
- Closed Question: 256-1176 (HA: 184, HE: 208, LC: 280, LO: 280)
- Open Question: 240-1208 (HA: 272, HE: 280, LC: 192, LO: 192)
- Lesson Summary:
- Closed Question: 80-520 (HA: 72, HE: 136, LC: 104, LO: 136, HN: 128)
- Open Question: 512-432 (HA: 432, HE: 360, LC: 376, LO: 352, HN: 2032)
- Number of Questions per Category:
- Pre-lesson Introduction: 100 (closed) and 200 (open) questions
- New Lesson Instruction: 200 (closed) and 400 (open) questions
- Knowledge Consolidation: 400 (closed) and 800 (open) questions
- Classroom Exercises: 300 (closed) and 1200 (open) questions
- Lesson Summary: 100 (closed) and 800 (open) questions
- Distribution of Subjective Evaluation Dataset: Refer to the chart (Figure A11) for details.
Model Fine-tuning Process
- Conducted using high-performance computing cluster with 8 A6000 GPUs for efficient training and stable performance
- SWIFT framework employed: reduces computational cost of fine-tuning large models through integration of Low-Rank Adaptation (LoRA) method
- Minimizes parameter updates with low-rank matrices
- Personalizes large models without extensive resources
- Hyperparameters fine-tuned for each personality type based on language characteristics and cognitive development
- Table A4 provides hyperparameter settings:
- Model: Fine-tuning parameters for High Extraversion (HE), High Neuroticism (HN), Low Openness (LO), High Agreeableness (HA), and Low Conscientiousness (LC) students.
Model Hyperparameters:
Personality | Hyperparameter |
---|---|
HE, HA | Learning rate: 1.0E-04 |
HE, HA | Num train epochs: 3 |
HN, LC | Learning rate: 1.0E-04 |
HN, LC | Num train epochs: 3 |
LO | Dropout: 0.2 |
LO | Learning rate: 5.0E-05 |
LO | Num train epochs: 2 |
All | LoRA alpha: 64 for HE, HN, and HA; 32 for LO |
Pre-Fine Tuning Student Responses:
Ode to the Yellow River
- HN Student: Gives examples of poet's praise for the heroic spirit of the Yellow River (long, continuous flow, nurtures generations, steadfast and unyielding)
- LLaVa Student: Depicts grandeur of Yellow River as a large body of water with vast hydropower potential
- MiniCPM Student: Speaks for the people and reflects societal suffering
- Qwen Student: Mother river of China, nurturing Chinese civilization
Memories of Mr. Lu Xun
- HN Student: Impressions of Mr. Lu Xun as a famous modern writer and thinker who advocated for vernacular language movement
- LLaVa Student: Daily scenes like sleeping, eating, and playing hide-and-seek with Mr. Lu Xun
- MiniCPM Student: Focuses on emotional bond between Xiao Hong and Mr. Lu Xun
Ode to the Lotus
- HE Student: Lotus grows in clear water and symbolizes purity, integrity
The Ballad of Mulan
- HE Student: Highlights Mulan's filial piety and bravery, symbolizing women's independent spirit
Importance of the Yellow River to China
- HA Student: Mother river of China, nurturing Chinese civilization, important water resource, significant transportation route
Natural Landscapes in Northeast China (Oath of the Land)
- LLVa Student: Mountains, forests, lakes
Praise for Lotus (Ode to the Lotus)
- Zhou Dunyi: Embodies qualities of purity and refusal to succumb to worldly corruption
Wen Yiduo's Unconventional Acceptance of Zang Kejia
- Wen Yiduo: Values talent, not just academic achievements
Post-Fine Tuning Student Responses:
Ode to the Yellow River
- HN Student: Mother river of China, symbolizes Chinese nation
- LLaVa Student: Clear water, significant natural resource and historical significance
- MiniCPM Student: Continuous flow, nurtures generations
- Qwen Student: Important role in cultural transmission and development
Oath of the Land (Northeast China)
- Natural landscapes: Large mountains, waters, grasslands
Wen Yiduo's Words and Deeds
- Values academic work and political activities seriously
Sun Quan Urges L¨u Meng to Study
- Encourages studying by explaining The Six Secret Teachings
Ah Chang and the Classic of Mountains and Seas
- Ah Chang buys book for Lu Xun due to shared interest in ancient Chinese culture
Deng Jiaxian's Contributions
- Great scientist who made significant contributions to China's nuclear weapons program
Impressions of Mr. Lu Xun (Xiao Hong's Recollections)
- Great literary figure, deeply influential on Chinese literature.
Human Turing Evaluation Questionnaire
Experiment Overview:
- Inspired by the classic Turing test
- Aimed to evaluate whether LVSA could emulate human language expression and cognitive levels
- Participants acted as "judges" to determine if generated dialogues resembled natural human student responses
Experimental Steps:
- Questionnaire completion: Participants judged whether student responses resembled real ones or not
- Think-aloud protocol: Participants verbalized their thought process during the questionnaire
- Semi-structured interviews: Participants elaborated on factors affecting their evaluations
Participant Recruitment:
- 35 participants were recruited, including pre-service and in-service teachers
- Participants had experience teaching junior high school Chinese and expressed willingness to participate
- Detailed information of 10 selected participants was provided in the table
Pre-survey of Evaluators:
- Participants were asked to provide personal information, teaching experience, and familiarity with junior high Chinese textbooks
- They also indicated their willingness to participate in think-aloud exercises and semi-structured interviews
Experiment Procedure:
- Experimenter provided detailed explanations and guidelines to ensure scientific rigor
- Participants were asked to judge 120 teacher-student dialogues, some involving virtual students
- Participants' verbalizations during the questionnaire were recorded for analysis
Evaluation Criteria:
- Based on the Big Five personality traits of students (Carefulness, Cooperation, Nervousness, Low Tolerance, and High Extraversion)
- Participants ranked student descriptions based on these traits
Think-aloud Protocol:
- Participants verbally explained their thought process while answering questions
- Recorded for analysis to determine the criteria used in distinguishing virtual from real student language performances
Semi-structured Interviews:
- Participants elaborated on factors affecting their evaluations
- Responses were coded for consistency and used to develop a scientific evaluation framework
Compensation:
- Participants received appropriate compensation and a small token of appreciation for their contributions.
Human Turing Evaluation Results
Table A6:
- Evaluator performances for distinguishing virtual students from real ones
- Columns: HE, HN, LO, HA, LC (High Extraversion, High Neuroticism, Low Openness, High Agreeableness, Low Conscientiousness)
- Rows: Evaluators' judgments before and after interaction with students
- Evaluator1 to Evaluator10 results
Evaluation Metric:
- Probability of identifying fine-tuned virtual students as real students
Findings:
- Average recognition probability for fine-tuned virtual students exceeds 0.9 across personality dimensions
- Some virtual students with specific traits (high neuroticism, low conscientiousness, low openness) become indistinguishable from real students
- Language generation for these virtual students is more convincing and human-like in teaching scenarios.
Fleiss's Kappa:
- Measure of inter-rater agreement used to evaluate participants' judgments of virtual versus real students
- A value between 0.6 and 0.8 indicates substantial agreement, suggesting high consensus among participants
- Experiment result showed a Fleiss’s Kappa value of 0.6917, indicating strong agreement in participants' judgments.
Impact:
- Virtual students based on large language models have significant potential to emulate human language expression effectively in teaching scenarios.
GPT-4 Large-Scale Evaluation Prompt
Coding of Interview Content:
- Two students coded evaluator interview transcripts using ATLAS.ti software
- Employed a two-level coding method to ensure thoroughness and accuracy
- Identified four primary codes and 15 secondary codes capturing multidimensional factors evaluators focused on when distinguishing virtual from real students
- Coding dimensions included:
- Cognitive Level: Complexity, Reasonableness, Logicality
- Psychological State: Suspicions, Interaction, Nervousness, Reflection
- Verbal Expression: Personalization, Sentence Structure, Oral Language, Fluency, Pronoun Usage, Length, Emotional Integration
Prompt Design:
- Coding dimensions were selected as key evaluation criteria in prompt design
- Adopted a "chain-of-thought" (CoT) reasoning approach to guide GPT-4's step-by-step assessment of virtual student responses
- Integrated coded evaluation dimensions into prompts to ensure consistent evaluation outcomes between GPT-4 and human evaluators
Comparison of Human Evaluation and GPT-4 Evaluation:
- Consistency between GPT-4 and human evaluators was quantified using Fleiss's Kappa coefficient
- GPT-4's average evaluation performance reached 0.978, demonstrating high alignment with human evaluators
- Evaluations for certain personality types exceeded human assessments, while others fell below
- Overall, the study found a "substantial agreement" (Fleiss's Kappa consistency coefficient = 0.6806) between GPT-4 and human evaluators
Bad Case Examples of LLMs' Performance
Limitations of LLMs in Role-Playing Contexts:
- Anomalies identified: limited improvement in low-conscientiousness personalities, absence of fine-tuning effects across all question types in some models, and relatively weaker performance of LLaVa compared to other models post-fine-tuning.
Low Conscientiousness Personalities:
- Poor fine-tuning performance: attributed to sparse distribution of relevant data in original training data
- Increased likelihood of "hallucination phenomena" where model struggles to maintain semantic coherence
- Complicating factors: negative traits linked to antisocial behavior or marginalized groups, and measures to avoid reinforcing such characteristics during training.
Comparison of Closed-Ended and Open-Ended Questions:
- Inconsistent fine-tuning effects across question types: no statistically significant differences for InternVL, Qwen, and MiniCPM (p-values > 0.05)
- Structural differences between questions: closed-ended require precise recall, open-ended involve reasoning, emotional expression, and creative thinking.
LLaVa Model Performance:
- Lower overall post-fine-tuning accuracy of around 67.85% compared to other models
- Likely due to pretraining predominantly on English data, limiting performance in Chinese-language applications
- Future research: incorporating more extensive Chinese-language pretraining or customized fine-tuning tailored to Chinese contexts.