# The Retraining Process

This kind of system would, in theory, be an ideal candidate for a Group Relative Policy Optimisation fine-tuning approach, as used by Deepseek (Shao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." *arXiv preprint arXiv:2402.03300* (2024)), given that the ideal outcome is a large number of competing answers to the same few basic questions. Unfortunately, we have, as yet, been unable to run our system at a scale sufficient to provide the data required for such an approach.&#x20;

Failing this, we had recourse to DPO fine-tuning. Under a normal fine tuning process, we simply input a list of prompts and chosen/rejected responses. The problem in the case of the Generalising Agent, is that we have two sets of prompts and responses, with some runs producing chosen but not rejected responses and others producing rejected responses but not chosen ones.&#x20;

Thus, for example, if a strategy never succeeds we have a rejected response but no chosen one, while if it succeeds first time we have a chosen response but no rejected one. In the latter case it is possible to generate a high-temperature garbled output to serve as the rejected response, in the former case, unfortunately, the run must be discarded. This process can be summarised as follows:

The result is a DPO file composed from the following data taken from the chat history of the model interactions:

| Process                                                                     | Prompt                                                                                          | Response | Score         |
| --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | -------- | ------------- |
| If a strategy generates code that compiles and runs but fails to free space | `get_system_plist + get_basic_env_plist + get_special_egc_req_plist + get_strat_code_req_plist` | `strat`  | `0`           |
| If a strategy generates code that failed to compile/run                     | `new_sp_egc_s`                                                                                  | `strat`  | `0`           |
| If a strategy generates successful code                                     | `strat`                                                                                         | `code`   | `space_freed` |
| If a strategy generates successful code                                     | `get_system_plist + get_basic_env_plist + get_special_egc_req_plist + get_strat_code_req_plist` | `strat`  | `space_freed` |

This is then sorted/abandoned/paired with deliberately garbled code per the requirements of the training process:

<table><thead><tr><th>Process</th><th width="129">Prompt</th><th>Chosen</th><th>Rejected</th></tr></thead><tbody><tr><td>If no <code>code</code> capable of running or compiling is produced by the end of a run</td><td>-</td><td>-</td><td>-</td></tr><tr><td>If some <code>code</code> attempts during the run freed up space and some didn’t</td><td><code>strat</code></td><td><code>code</code> (that resulted in a score over 0)</td><td><code>code</code> (that resulted in a score of 0)</td></tr><tr><td>If everything in the run worked first time</td><td><code>get_system_plist + get_basic_env_plist + get_special_egc_req_plist + get_strat_code_req_plist</code></td><td><code>strat</code> (that resulted in a score over 0)</td><td>deliberately garbled version of <code>strat</code></td></tr><tr><td>If everything in the run worked first time</td><td><code>strat</code></td><td><code>code</code> (that resulted in a score over 0)</td><td>deliberately garbled version of <code>code</code></td></tr></tbody></table>

A pipeline automates the process of training an AI language model using Direct Preference Optimization (DPO) jsonl dataset.&#x20;

Here's a high-level overview:

1. Environment Setup:
   * Creates a requirements file with necessary ML/AI libraries
   * Sets up environment variables for Docker image management
2. Training Script Creation:
   * Generates a Python script that implements DPO training
   * Uses 4-bit quantization for memory efficiency
   * Implements Low-Rank Adaptation (LoRA) for efficient fine-tuning
   * Handles both the main model and a reference model for training
   * Includes utility functions for parameter management and layer detection
3. Docker Configuration:
   * Creates a Dockerfile based on Python 3.10
   * Sets up necessary system dependencies and build tools
   * Configures the training environment
4. Training Process:
   * Accepts a JSONL file as training data
   * Builds a Docker container with GPU support
   * Executes the training process within the container
   * Preserves the trained model outputs
5. Post-Processing:
   * Cleans up Docker resources
   * Archives the resulting model artifacts


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://superioragents.gitbook.io/superior-agents-documents/case-study/learning-through-exploration-with-agir/the-retraining-process.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
