Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring for Model Evals #2

Merged
merged 10 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 66 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,33 @@
# Shallow Search: Evaluating Deepseek R1 for Apache Error Log Analysis and Error Resolution

## Quickstart

Create a `.env` file at the project root directory to store the following API keys:

```
X_OPENAI_API_KEY="<OPENAI API KEY>"
X_GROQ_API_KEY="<GROQ API KEY>"
```

Using pip:

```
pip install -r requirements.txt
python main.py <model>
```

where `<model>` can be:

- `gpt4o`
- `deepseek-70b-llama-groq`
- `llama-3-3-70b-groq`

## Abstract

The objective of this study was to evaluate the performance of large language models (LLMs) in generating synthetic data for descriptions and solutions in Apache error logs while using reinforcement learning methodologies. Specifically, we examined DeepSeek R1, a reinforcement learning-enhanced model that is a distilled off of Groq’s LLaMA 70B. The study employed cosine similarity to assess the correctness of generated outputs against a labeled dataset and incorporated n-gram analysis to refine prompts. We observed significant model biases in sequential text processing and a dependency between severity misclassification and increased loss in description/solution generation. The results highlight limitations in distinguishing between similar error categories and the impact of overfitting in prompt design.

Keywords: Large Language Models, DeepSeek R1, Apache Error Logs, Reinforcement Learning, Prompt Optimization, Cosine Similarity


## Introduction

Large Language Models (LLMs) have demonstrated strong capabilities in understanding and generating natural language, but their performance varies significantly based on task-specific nuances. In system administration and software diagnostics, automated log interpretation is critical for debugging and anomaly detection. However, log entries can be highly ambiguous, requiring contextual understanding beyond simple pattern matching. This study investigates the application of DeepSeek R1, a reinforcement learning-trained LLM, to generate accurate explanations for Apache error logs.
Expand All @@ -16,26 +37,39 @@ DeepSeek R1 incorporates reinforcement learning to improve reasoning, making it
## Methodology

### Data Preprocessing and N-Gram Analysis

We utilized a Python-based approach to analyze actual logs and extract the most common n-grams using the Natural Language Toolkit (NLTK). Specifically, we identified consecutive n-grams ranging from unigrams (1-gram) to fourteen-grams (14-gram). These extracted n-grams were manually analyzed from a linguistic perspective and integrated into the prompts to ensure that the output accurately reflected the most frequently occurring patterns in the logs.

### Model Distillation and Reinforcement Learning

DeepSeek R1, a reinforcement learning-enhanced model, was distilled from Groq’s LLaMA 70B. The distillation process involved training a smaller, more efficient model while transferring knowledge from the larger LLaMA 70B model. To enhance DeepSeek R1’s performance, we applied manual reinforcement learning by systematically reviewing its outputs and removing incorrect responses. This was necessary due to inconsistencies in how DeepSeek classified errors; for instance, it often misclassified actual errors as non-errors.
We observed that DeepSeek R1 struggled with short phrases, such as “Done...”, which negatively impacted classification accuracy. As a mitigation strategy, we employed few-shot prompting by providing each enumeration (enum) field with 4-5 guiding examples to improve response reliability.

### Prompt Optimization and Performance Benchmarking

DeepSeek R1:1.5B (local Ollama) exhibited significantly worse performance on enum fields, with a loss increase of approximately 0.5 under the same prompt conditions. To address this, we experimented with minimizing the prompt content to provide a clearer reasoning direction.
We also modified our benchmarking approach to improve testing efficiency. This included:
Implementing a try/except block for keyboard interrupts to allow batch testing.
Adding a max_retries mechanism to handle formatting failures.
Explicitly structuring the prompt with formatting instructions at both the start and end to ensure better model attention.
Additionally, DeepSeek R1 showed a preference for information introduced earlier in the prompt. We leveraged this observation by prioritizing descriptions over solutions, which initially resulted in better performance. To mitigate this bias, we added contextual details to solutions, which narrowed the accuracy gap from ~0.25/0.3 to ~0.15/0.2.

### Overfitting Considerations

While some individual loss values were as low as ~0.05, general overfitting was minimal due to the structured nature of Apache Error Logs. The dataset was carefully curated to space out keywords appropriately, reducing overfitting risks while maintaining a high level of accuracy.

### Embedding-Based Similarity Measurement

To further refine error classification, we leveraged spaCy’s word vector representations to enhance similarity measurements between actual and synthetic error log descriptions. This involved encoding the dataset into embeddings and integrating them into DeepSeek R1 for more consistent outputs.

### Dataset Structuring and Compression

To facilitate structured learning, we converted the validated dataset into a structured format suitable for DeepSeek R1’s processing. We also explored two compression techniques for handling CSV data:
Singular Value Decomposition (SVD) for dimensionality reduction.
LLMLingua prompt compression to reduce prompt length while preserving key semantic details.

### Inference Optimization

For performance reasons, DeepSeek R1 was called only once per evaluation cycle. Given its high processing speed, re-evaluating outputs multiple times was computationally expensive and impractical. Instead, we optimized the prompt structure to maximize accuracy within a single inference pass.

### Dataset description
Expand Down Expand Up @@ -64,15 +98,38 @@ Dimensionality Reduction Attempts: Explored LLMLingua compression and SVD-based

## Results and Discussion

### Tentative Model Results

Results obtained after a single run on Sunday, February 16, 2025.

| Model Name | Error Type Loss | Severity Loss | Description Loss | Solution Loss |
|---|---|---|---|---|
| DeepSeek-70B-Llama (Groq) | 0.1875 | 0.0563 | 0.1300 | 0.1956 |
| Llama-3.3 (Groq) | 0.8500 | 0.9688 | 0.1453 | 0.2187 |
| GPT-4o | 0.3312 | 0.0437 | 0.1356 | 0.2353 |

For these losses, lower means better performance.

Run using:

```shell
uv run main.py deepseek-70b-llama-groq
uv run main.py llama-3-3-70b-groq
uv run main.py gpt4o
```

### Model Performance Metrics

#### DeepSeek R1 Distilled (LLaMA 70B) Performance

![DeepSeek 70B Performance](bench1.png)

#### DeepSeek R1 1.5B Performance
#### DeepSeek R1 1.5B Performance

![DeepSeek 1.5B Performance](bench2.png)

Final validation techniques:

- removed overfit prompts
- no SVD compression
- no LLMLingua compression
Expand All @@ -82,12 +139,15 @@ Final validation techniques:
- no n-gram analysis on validation dataset

### Sequential Processing Bias

DeepSeek R1 interprets text in order of appearance, leading to prioritization of early fields. By placing the "solutions" field earlier in the prompt, accuracy improved.

### Severity Misclassification Impact

Incorrect severity predictions correlated with increased loss in description/solution accuracy (~0.2-0.25).

### Error Type Distinctions

Poor differentiation between fatal and runtime, though this had minimal impact on descriptions/solutions.
SSL support unavailable classified as notice, aligning with dataset expectations and showing no adverse effect.
Done... misclassified as warn instead of notice, negatively impacting generated descriptions/solutions.
Expand All @@ -97,16 +157,19 @@ Child init [number 1] [number 2]... wrongly classified as an "error," leading to
Can't find child 29722 in scoreboard... consistently misclassified as warn instead of error, with no major impact.

### Overfitting and Prompt Refinement

The model overfit responses for specific patterns (e.g., Check the Apache configuration files had lower scores than Check the configuration of workerEnv).
If predicted=notice and actual=error, solutions were significantly degraded.
Injecting edge cases into prompts reduced logical inconsistencies in responses.

### Performance Comparisons

DeepSeek R1 1.5B: Average response time ~5s on an M3 Max chip (32GB RAM).
DeepSeek R1 Distilled (LLaMA 70B API): Average response time ~15s.
Smaller prompts led to better performance in DeepSeek R1 1.5B, indicating a need for more explicit guidance in reasoning tasks.

### Dimensionality Reduction Failures

LLMLingua and SVD-based encoding resulted in poor outputs, showing that compression techniques were ineffective for log analysis.
SpaCy-based embeddings were explored but found unnecessary given the structured nature of the dataset.

Expand All @@ -121,13 +184,11 @@ Future work should explore alternative reinforcement learning fine-tuning approa
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
https://arxiv.org/abs/2402.03300


DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
https://arxiv.org/abs/2501.12948


LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
https://arxiv.org/abs/2310.05736

Distilling the Knowledge in a Neural Network
https://arxiv.org/abs/1503.02531
https://arxiv.org/abs/1503.02531
Loading