LLMs in Education: Novel Perspectives, Challenges, and Opportunities

by Bashar Alhafni, Sowmya Vajjala, Stefano Banno, Kaushal Kumar Maurya, Ekaterina Kochmar https://arxiv.org/html/2409.11917v1

LLMs in Education

Abstract

Overview:

Role of LLMs in education is an increasing area of interest
Offers new opportunities for teaching, learning, and assessment
This tutorial provides an overview of educational applications of NLP
- Impact of recent advances in LLMs on this field

Discussion Topics:

Key challenges and opportunities presented by LLMs in education
Four major educational applications:
- Reading skills
- Writing skills
- Speaking skills
- Intelligent tutoring systems (ITS)

Audience:

Researchers and practitioners interested in the educational applications of NLP
The first tutorial to address this timely topic at COLING 2025.

1 Introduction

Large Language Models (LLMs)

GPT-3.5 (ChatGPT): remarkable capabilities across various tasks Wei et al. (2022); Minaee et al. (2024)
Rapid adoption by EdTech companies: Duolingo Naismith et al. (2023), Grammarly Raheja et al. (2023)
Impact on educational applications research and development: enabling new opportunities in writing assistance, personalization, and interactive teaching and learning, among others

Ethical Considerations in Integrating LLMs into Educational Settings

Paradigm shift in educational applications
Present novel challenges regarding ethical considerations Bommasani et al. (2021)

Key Topics Covered

Examining the challenges and opportunities presented by LLMs for educational applications through four key tasks:
- Writing assistance
- Personalization
- Interactive teaching and learning

2 Outline

Impact of LLMs on:
- Writing assistance
- Reading assistance
- Spoken language learning and assessment
- Development of Intelligent Tutoring Systems (ITS)

2.2 LLMs for Writing Assistance

Grammatical Error Correction (GEC) Tutorial

Overview:

Focuses on GEC for writing assistance
Automatically detects and corrects errors in text
Provides pedagogical benefits to L1 and L2 language teachers and learners: instant feedback, personalized learning
Covers history, popular datasets (Yannakoudakis et al., 2011; Napoles et al., 2017; Náplava et al., 2022), evaluation methods (Bryant et al., 2023), and techniques: rule-based to sequence-to-sequence models

LLMs for GEC:

Thorough overview on using Language Models (LLMs) for GEC
Evaluates their performance in terms of fluency, coherence, and fixing various error types across popular benchmarks (Fang et al., 2023; Raheja et al., 2024; Katinskaia and Yangarber, 2024)
Discusses prompting techniques and strategies for GEC, evaluating their effectiveness and limitations (Loem et al., 2023)
Compares LLMs to supervised GEC approaches, examining strengths and weaknesses (Omelianchuk et al., 2024)
Discusses using LLMs to evaluate GEC systems (Kobayashi et al., 2024)
Provides insights into future directions: evaluation, usability, interpretability from a user-centric perspective.

2.3 LLMs for Reading Assistance

Readability assessment: important part of literacy; ability to read and comprehend text plays major role in critical thinking, effective communication
Over two decades of NLP research on:
- Readability assessment
- Text simplification
Focused on improving accessibility to written content

Approaches:

Various methods explored in NLP research over time
Challenges involved include maintaining context and preserving meaning while making adjustments

Domain Specific Issues and Multilingual Approaches:

Research addressing domain-specific issues Garimella et al. (2021)
- Medical terminology, technical jargon
Advancements in multilingual text simplification Saggion et al. (2022)

LLMs for Reading Support:

Arrival of LLMs led to new advances in readability assessment and text simplification Kew et al. (2023)
Zero-shot usage, prompt tuning, fine-tuning Lee and Lee (2023); Tran et al. (2024)
- Improving accuracy and relevance to individual users

Current Limitations:

Limitations of current approaches Štajner (2021)
- Difficulty in preserving context and meaning when making adjustments
Focus on user-based evaluation Vajjala and Lučić (2019); Säuberli et al. (2024); Agrawal and Carpuat (2024)
- Ensuring that simplifications are meaningful to the intended audience

Future Directions:

Increased focus on user-based evaluation
Supporting more languages Shardlow et al. (2024)
Exploring new techniques for preserving context and meaning while simplifying text.

2.4 LLMs for Spoken Language Learning and Assessment

Speaking as a Crucial Language Skill

Speaking is a core language skill in language education curricula (Fulcher, 2000)
Increasing interest in automated spoken language proficiency assessment

Overview of Automated Spoken Language Assessment and Writing Assessment

History: overview of approaches to speaking assessment and their counterpart - automatic essay scoring (Burstein, 2002; Rudner et al., 2006; Landauer et al., 2002)
Breakthroughs: first commercial systems for speech assessment (Townshend et al., 1998; Xi et al., 2008) and deep neural network approaches for writing (Alikaniotis et al., 2016) and speaking (Malinin et al., 2017) assessment
Applications: analytic assessment, grammatical error detection (GEC), and correction (GED)

Focus on LLMs for Spoken Language Assessment and Feedback

Use of text-based foundation models like BERT (Devlin et al., 2019) for holistic assessment (Craighead et al., 2020; Raina et al., 2020; Wang et al., 2021) and analytic assessment approaches (Bannò et al., 2024b)
Use of speech foundation models like wav2vec 2.0 (Baevski et al., 2020), HuBERT (Hsu et al., 2021), and Whisper (Radford et al., 2022) for mispronunciation detection, pronunciation assessment (Kim et al., 2022), and holistic assessment (Bannò and Matassoni, 2023; Bannò et al., 2023)
Addressing interpretability issues using analytic assessment approaches for writing
Exploring opportunities for multimodal models like SALMONN (Tang et al., 2024) and Qwen Audio (Chu et al., 2023), as well as text-to-speech models like Bark (Schumacher et al., 2023) and Voicecraft (Peng et al., 2024) for assessment and learning.

2.5 LLMs in Intelligent Tutoring Systems (ITS)

Overview:

Computerized learning environments that provide personalized feedback based on learning progress
Capable of providing one-on-one tutoring for equitable and effective learning experience
Leads to substantial learning gains

Lack of Individualized Tutoring:

Less effective learning and dissatisfaction in large classrooms

Key Principles of Learning Sciences:

Goals for ITS development
Outline: Pre-LLM ITS systems, including those tailored for misconception identification, model-tracing tutors, constraint-based models, Bayesian network models, and systems designed for specific knowledge areas

LLMs in Intelligent Tutoring Systems Development:

Question solving
Error correction
Confusion resolution
Question generation
Content creation
Simulating student interactions for teaching assistants and teacher training
Assisting in creating datasets for fine-tuning LLMs
Developing prompt-based techniques or modularized prompting for ITS development

Future Directions:

Development of standardized evaluation benchmarks to assess progress in ITS
Collection and creation of large public educational datasets for LLM training and fine-tuning
Development of specialized foundational LLMs for educational purposes
Investigations into long-term impact on students and teachers
Examining ethical considerations, potential biases, and pedagogical value in dialogue-based ITS.

3 Recommended Reading List

Relevant papers cited in this proposal
Available on the tutorial website

4 Target Audience

Graduate students, researchers, and practitioners attending COLING 2025
Background in Computational Linguistics (CL), NLP, or Machine Learning (ML)
Interested in educational applications of NLP and generative AI
Basic knowledge of educational technologies not required

5 Tutorial Description

Self-contained and accessible to a wide audience
Focuses on advanced technologies for various educational applications
Covers recent advances brought by generative AI and LLMs
Addresses opportunities, challenges, and risks in the field
First tutorial on this topic at COLING or any other CL conference

6 Diversity Considerations

Tutorial covers a wide range of applications
Highlights opportunities to reach underrepresented groups
Addresses fairness and accessibility challenges
Instructors are diverse in gender, nationality, affiliation, and seniority
Includes open Q&A sessions for participant engagement and discussion.

Tutorial Reading List

COLING 2025 Tutorial Additional Reading List

Datasets

The CoNLL-2014 Shared Task on Grammatical Error Correction
The First QALB Shared Task on Automatic Text Correction for Arabic
The Second QALB Shared Task on Automatic Text Correction for Arabic
The BEA-2019 Shared Task on Grammatical Error Correction
Grammar Error Correction in Morphologically Rich Languages: The Case of Russian
Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language

Evaluation methods and their reliability

Ground Truth for Grammatical Error Correction Metrics
There's No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction
Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction
Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems
Classifying Syntactic Errors in Learner Language
IMPARA: Impact-Based Metric for {GEC} Using Parallel Data
Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality
Inherent Biases in Reference-based Evaluation for Grammatical Error Correction

Methods: Statistical and rule-based

Detection of Grammatical Errors Involving Prepositions
The Ups and Downs of Preposition Error Detection in ESL Writing
Grammatical Error Correction with Alternating Structure Optimization
Joint Learning and Inference for Grammatical Error Correction
Generalized Character-Level Spelling Error Correction
Grammatical error correction using hybrid systems and type filtering
The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation
Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction

Methods: sequence-to-sequence

Grammatical error correction using neural machine translation
Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task
Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models
Neural and FST-based approaches to grammatical error correction
Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data
Stronger Baselines for Grammatical Error Correction Using a Pretrained Encoder-Decoder Model
Document-level grammatical error correction

Methods: text-editing neural models

Parallel Iterative Edit Models for Local Sequence Transduction
Encode, Tag, Realize: High-Precision Text Editing
Seq2Edits: Sequence Transduction Using Span-level Edit Operations
FELIX: Flexible Text Editing Through Tagging and Insertion
Character Transformations for Non-Autoregressive {GEC} Tagging
EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start
An Extended Sequence Tagging Vocabulary for Grammatical Error Correction

LLMs for GEC

Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation
Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction
ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark
Prompting open-source and commercial language models for grammatical error correction of English learner text
MEDIT: Multilingual Text Editing via Instruction Tuning
GPT-3.5 for Grammatical Error Correction
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods
Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

LLMs as GEC Evaluators / Explainors

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction
GMEG-EXP: A Dataset of Human- and LLM-Generated Explanations of Grammatical and Fluency Edits
Controlled Generation with Prompt Insertion for Natural Language Explanations in Grammatical Error Correction

Recent papers

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Read, Revise, Repeat: A System Demonstration for Human-in-the-loop Iterative Text Revision
Improving Iterative Text Revision by Learning Where to Edit from Other Revision Tasks
Understanding Iterative Revision from Human-Written Text

Ethical considerations

Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance

Readability and Simplification

Surveys

Computational assessment of text readability: A survey of current and future research
Trends, limitations and open challenges in automatic readability assessment research
A survey of research on text simplification
Data-Driven Sentence Simplification: Survey and Benchmark

Shared Tasks

SemEval-2012 Task 1: English Lexical Simplification
SemEval 2016 Task 11: Complex Word Identification
SemEval-2021 Task 1: Lexical Complexity Prediction
The BEA 2024 Shared Task on the Multilingual Lexical Simplification Pipeline

Methods

The Principles of Readability
Do NLP and machine learning improve traditional readability formulas?
Multiattentive Recurrent Neural Network Architecture for Multilingual Readability Assessment
Text readability assessment for second language learners
Exploring hybrid approaches to readability: experiments on the complementarity between linguistic features and transformers
Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features
Automatic induction of rules for text simplification
Learning to simplify sentences with quasi-synchronous grammar and integer programming
Optimizing statistical machine translation for text simplification
Learning to Paraphrase Sentences to Different Complexity Levels
Elaborative Simplification for German-language Texts
Supervised and Unsupervised Neural Approaches to Text Readability
All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch

LLMs

Prompt-based Learning for Text Readability Assessment
FPT: Feature Prompt Tuning for Few-shot Readability Assessment
Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts
BLESS: Benchmarking Large Language Models on Sentence Simplification
An LLM-Enhanced Adversarial Editing System for Lexical Simplification
On Simplification of Discharge Summaries in Serbian: Facing the Challenges

Evaluation

Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts
Are Cohesive Features Relevant for Text Readability Evaluation?
On understanding the relation between expert annotations of text readability and target reader comprehension
Linguistic Corpus Annotation for Automatic Text Simplification Evaluation
The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification
Investigating Text Simplification Evaluation
Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension

Explainability:

Explainable AI in Language Learning: Linking Empirical Evidence and Theoretical Concepts in Proficiency and Readability Modeling of Portuguese
“Geen makkie”: Interpretable Classification and Simplification of Dutch Text Complexity

Broader impact/Ethical issues

Automatic text simplification for social good: Progress and challenges
When readability meets computational linguistics: a new paradigm in readability
Problems in Current Text Simplification Research: New Data Can Help

Spoken Language Learning and Assessment

Automated Speaking Assessment

The use of DBN-HMMs for mispronunciation detection and diagnosis in {L2 English} to support computer-aided pronunciation training
Improvements to an Automated Content Scoring System for Spoken CALL Responses: the ETS Submission to the Second Spoken CALL Shared Task
Automated Speaking Assessment: Using Language Technologies to Score Spontaneous Speech
Incorporating uncertainty into deep learning for spoken language assessment
Automated scoring of spontaneous speech from young learners of English using transformers
Automatic pronunciation assessment using self-supervised speech representation learning
View-Specific Assessment of L2 Spoken English
Proficiency assessment of L2 spoken English using wav2vec 2.0
Can GPT-4 do L2 analytic assessment?

Spoken GED and GEC

Automatic error detection in the Japanese learners’ English spoken data
Impact of ASR Performance on Spoken Grammatical Error Detection
Automatic Grammatical Error Detection of Non-Native Spoken Learner English
On Assessing and Developing Spoken ‘Grammatical Error Correction’ Systems
Towards End-to-End Spoken Grammatical Error Correction

LLMs for Speaking Assessment and Feedback

Transformer Based End-to-End Mispronunciation Detection and Diagnosis
Explore wav2vec 2.0 for Mispronunciation Detection
Automatic Assessment of Conversational Speaking Tests

Validity and reliability

Assessing L2 English speaking using automated scoring technology: examining automaker reliability

LLM in STEM Education and ITS

People have valued and thought deeply about education for a long time - John Dewey, 1923

Overview: Opportunities and Challenges

Opportunities and Challenges in Neural Dialog Tutoring
Are We There Yet? - A Systematic Literature Review on Chatbots in Education
A Systematic Literature Review of Intelligent Tutoring Systems With Dialogue in Natural Language

Pre-LLM ITS

AutoTutor 3-D Simulations: Analyzing Users' Actions and Learning Trends
Gaze tutor: A gaze-reactive intelligent tutoring system
Interactive Conceptual Tutoring in Atlas-Andes
Individualizing self-explanation support for ill-defined tasks in constraint-based tutors
Jacob-An animated instruction agent in virtual reality
Data mining in education

LLM in STEM Education and ITS

Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors
Backtracing: Retrieving the Cause of the Query
MATHDIAL: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems
Improving Teachers’ Questioning Quality through Automated Feedback: A Mixed-Methods Randomized Controlled Trial in Brick-and-Mortar Classrooms
Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes
GPTeach: Interactive TA Training with GPT-based Students
NAISTeacher: A Prompt and Rerank Approach to Generating Teacher Utterances in Educational Dialogues
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction
Demographic predictors of students’ science participation over the age of 16: An Australian case study

Evaluation of ITS

The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions
Evaluation Methodologies for Intelligent Tutoring Systems

Ethical Considerations

What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education

Files

LLMs-in-Education_Review_2409.11917v1.md

Latest commit

History

LLMs-in-Education_Review_2409.11917v1.md

File metadata and controls

LLMs in Education: Novel Perspectives, Challenges, and Opportunities

Contents