MDE in the era of Generative AI

Acknowledgment

This work is supported and funded by IBCO-CIMI-CNRS research project (call 2021-2024).

Description

We introduce LLM4MDE: a novel approach that combines in context learning and iterative prompting for program verification to ensure programs written in domain-specific languages are syntactically valid and incorporate environment constraints.

Main System Design

Specific Prompt:
Our solution starts with an input prompt holding two parameters:
- The Natural Language Description (NLD) of a future DSL including user requirements and semantics' constraints
- Document type definition on a specific modelling language (e.g., Ecore definition file)
We combine multiple prompting techniques: grammar prompting, Chain-of-Thought (COT), and tool use, to increase prompting quality.
Output Generation:
The LLM processes the input prompt to generate the serialized result (in XMI format) by LLM inference. For Ecore meta-model as target output, the inference:
- Parses Ecore language markers
- Extracts relevant concepts, entities, attributes, and relationships
- Builds structural model adhering to Ecore syntax
Model Validation:
The process involves:
- Parsing using domain grammar to check for syntactic errors
- Human intervention for semantics checking if no syntactic errors are found
- Iterative prompting with LLM agents to fix syntactic errors if present
Database Storage for Use Cases:
Validated outputs are stored in a database for further analysis and fine-tuning of the LLM, facilitating iterative refinement based on feedback and real-world scenarios.
Ambiguity Resolution:
For unresolved ambiguities or validation failures, we re-prompt the LLM with:
- The occurred errors
- Possible additional description from the user
- Access to external tools (e.g., API search calls, documentation) to enhance understanding
Fine-tuning and Model Improvements:
Validated results that were stored are used to fine-tune our LLMs after gathering sufficient validated models.

Distribution of one use case resolution

The following table depicts the distribution of use case resolutions for each model. These statistics were collected based on the resolution rate of one use case (SimplePDL use case) over 40 iterations by our top-performing models.

Model	Context provided	Correct output	N
GPT-4o	Zero-Shot	0	40
	Our prompt	2	40
	Iterative Prompting	32	40
meta-llama/Meta-Llama-3-70B-Instruct	Zero-Shot	0	40
	Our Prompt	21	40
	Iterative Prompting	30	40
mistralai/Mixtral-8x7B-Instruct-v0.1	Zero-Shot	0	40
	Our Prompt	8	40
	Iterative Prompting	18	40

Fine-Tuning Job

To improve our model's performance, we conducted a fine-tuning process:

Data Simulation: We synthetically generated textual representations of Ecore files to create a training and validation dataset.
Data Collection: We web-scraped 200 .ecore files from GitHub and reverse-engineered them into natural language descriptions.
Data Formatting: We followed OpenAI's documentation to format the data for instruct-type models.
Fine-Tuning: Due to availabilty, we only fine-tune GPT-3.5-Turbo.

Performance Evaluation: We compared the performance of different models alongside the fine-tuned model: