Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Back

Published

May 1, 2024

Updated

May 1, 2024

Can AI Write Good Math Test Questions? (Hint: Not Yet)

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Jaewook Lee|Digory Smith|Simon Woodhead|Andrew Lan

https://arxiv.org/abs/2405.00864v1

Summary

Creating good math test questions is hard. It's not enough to just come up with a problem; the wrong answers (called "distractors") need to be carefully chosen to reflect common student mistakes. This makes it tricky to automate the process with AI. A new research paper explores using Large Language Models (LLMs), like the kind that power ChatGPT, to help teachers write multiple-choice math questions. Researchers built a tool called HEDGE (Human Enhanced Distractor Generation Engine) that lets LLMs generate the question stem, correct answer, and explanation, and then suggests some distractors. A teacher then reviews and edits everything to make sure it's accurate and relevant. In a pilot study with math teachers, the LLM (GPT-4) did a decent job creating the questions themselves – teachers found about 70% of them usable. However, the AI really struggled with the distractors, with only about 37% being deemed valid. This highlights a key limitation of current LLMs: they aren't good at understanding how students actually think and where they're likely to make mistakes. While the AI could generate grammatically correct and mathematically sound questions, it often missed the mark on creating plausible distractors. For example, it might offer a distractor that's mathematically wrong but not in a way a student would actually calculate. The teachers often had to step in and rewrite the distractors to better reflect real student errors. This research suggests that while AI can be a useful tool for automating parts of question creation, human expertise is still essential, especially when it comes to understanding student misconceptions. Future research could explore ways to improve LLMs' ability to generate distractors, perhaps by training them on datasets of student responses or by giving them access to a "bank" of common misconceptions. This could lead to more effective and efficient ways to create high-quality math assessments.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does HEDGE (Human Enhanced Distractor Generation Engine) work in generating math test questions?

HEDGE is an AI-powered tool that uses Large Language Models (specifically GPT-4) to generate multiple components of math test questions. The process involves three main steps: First, the LLM generates the question stem, correct answer, and explanation. Second, it suggests potential distractors (wrong answer choices) based on mathematical principles. Finally, a human teacher reviews and edits all components to ensure accuracy and relevance. The system achieved about 70% usability for question stems but only 37% validity for distractors, demonstrating its current limitations in understanding genuine student misconceptions.

What are the benefits of using AI in educational assessment creation?

AI in educational assessment offers several key advantages. It can significantly reduce the time teachers spend creating basic test questions, allowing them to focus more on student interaction and specialized instruction. The technology can generate a large volume of practice problems quickly, enabling personalized learning paths for students. Additionally, AI can help maintain consistency in question formatting and difficulty levels across multiple assessments. However, as shown in current research, AI works best as a supportive tool rather than a complete replacement for human expertise in assessment creation.

How are AI-powered educational tools changing classroom assessment methods?

AI-powered educational tools are transforming classroom assessment by introducing more efficient and scalable methods. These tools can quickly generate practice problems, provide immediate feedback, and adapt to student performance levels. They help teachers save time on routine task creation while offering opportunities for more personalized learning experiences. However, research shows that human oversight remains crucial, especially in understanding student thinking patterns and common mistakes. This hybrid approach of AI assistance with teacher expertise represents the current best practice in educational assessment.

PromptLayer Features

Testing & Evaluation
The paper's methodology of evaluating AI-generated questions and distractors aligns with PromptLayer's testing capabilities

Implementation Details

Set up systematic A/B testing between different prompt versions for math question generation, track teacher approval rates, and implement regression testing to maintain quality

Key Benefits

• Quantitative measurement of prompt effectiveness • Systematic comparison of different prompt strategies • Early detection of quality degradation

Potential Improvements

• Add specialized metrics for distractor quality • Implement teacher feedback collection system • Create benchmark datasets of verified questions

Business Value

Efficiency Gains

Reduce manual testing effort by 60-70% through automated evaluation pipelines

Cost Savings

Lower quality assurance costs by automating initial screening of generated questions

Quality Improvement

More consistent quality through standardized evaluation criteria

Analytics
Workflow Management
The HEDGE system's human-in-the-loop review process maps to PromptLayer's workflow orchestration capabilities

Implementation Details

Create multi-step workflows for question generation, human review, and refinement with version tracking

Key Benefits

• Streamlined review process • Version control of approved questions • Reusable templates for different math topics

Potential Improvements

• Add automated quality checks • Implement feedback loops for prompt refinement • Create specialized templates for distractor generation

Business Value

Efficiency Gains

Reduce question creation time by 40% through structured workflows

Cost Savings

Optimize resource utilization by clearly defining human vs AI tasks

Quality Improvement

Better consistency through standardized review processes

Can AI Write Good Math Test Questions? (Hint: Not Yet)

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering