Skip to main content

Deep Dive: Context Assembly

What you'll learn

How ContextQueryBuilder in SochDB assembles context for LLM requests under a token budget, using prioritized sections, the TOON output format, and mathematical analysis of the I/O reduction achieved.

The Context Problem

LLMs have a finite context window. A conversation might involve:

Context SourceTypical TokensPriority
System prompt200–800Critical
Active skills prompt fragments100–500 eachHigh
Recent conversation history500–50,000High
Retrieved knowledge (RAG)200–5,000Medium
User profile / preferences50–200Medium
Tool schemas100–300 eachConditional
Previous tool results50–2,000Low

Total potential context: 60,000+ tokens. Typical model limit: 8,000–128,000 tokens. We need to fit the most important context within the budget.

This is formally a 0/1 knapsack problem.


Token Budgeting as Knapsack

Formal Definition

Given:

  • A set of context sections $S = {s_1, s_2, \ldots, s_m}$
  • Each section $s_i$ has a token cost $c_i$ and a value (priority) $v_i$
  • A total token budget $B$
  • A reserved allocation for the LLM's response: $R$

$$ \max \sum_{i=1}^{m} v_i \cdot x_i \quad \text{subject to} \quad \sum_{i=1}^{m} c_i \cdot x_i \leq B - R $$

Where $x_i \in {0, 1}$ indicates whether section $s_i$ is included.

ClawDesk's Approach

The full 0/1 knapsack is NP-hard, but ClawDesk uses a priority-tier greedy approach that runs in $O(m \log m)$:


ContextQueryBuilder API

SochDB's ContextQueryBuilder provides a fluent API for assembling context:

/// In clawdesk-sochdb/src/context.rs
pub struct ContextQueryBuilder {
budget: TokenBudget,
sections: Vec<ContextSection>,
output_format: OutputFormat,
}

impl ContextQueryBuilder {
/// Create a new builder with the given token budget.
pub fn new(budget: usize) -> Self {
Self {
budget: TokenBudget::new(budget),
sections: Vec::new(),
output_format: OutputFormat::Toon,
}
}

/// Reserve tokens for the LLM's response.
pub fn reserve_response(mut self, tokens: usize) -> Self {
self.budget.reserve(tokens);
self
}

/// Add a critical section (always included, panics if over budget).
pub fn system_prompt(mut self, prompt: &str) -> Self {
let section = ContextSection {
content: prompt.to_string(),
priority: Priority::Critical,
token_cost: estimate_tokens(prompt),
shrinkable: false,
};
self.budget.allocate_critical(section.token_cost);
self.sections.push(section);
self
}

/// Add skill prompt fragments.
pub fn skills(mut self, skills: &[SelectedSkill]) -> Self {
for skill in skills {
self.sections.push(ContextSection {
content: skill.prompt_fragment.clone(),
priority: Priority::High,
token_cost: skill.token_cost,
shrinkable: false,
});
}
self
}

/// Add conversation history (shrinkable — can be trimmed from oldest).
pub fn history(mut self, entries: &[HistoryEntry]) -> Self {
self.sections.push(ContextSection {
content: format_history(entries),
priority: Priority::High,
token_cost: estimate_tokens_history(entries),
shrinkable: true,
});
self
}

/// Add retrieved knowledge chunks.
pub fn knowledge(mut self, chunks: &[KnowledgeChunk]) -> Self {
for chunk in chunks {
self.sections.push(ContextSection {
content: chunk.content.clone(),
priority: Priority::Medium,
token_cost: chunk.token_count,
shrinkable: false,
});
}
self
}

/// Build the final context, fitting within the budget.
pub fn build(self) -> Result<AssembledContext, ContextError> {
let available = self.budget.available();
let mut included = Vec::new();
let mut remaining = available;

// Sort by priority (descending), then by value density
let mut sections = self.sections;
sections.sort_by(|a, b| {
b.priority.cmp(&a.priority)
.then_with(|| {
let density_a = a.priority.weight() as f64 / a.token_cost as f64;
let density_b = b.priority.weight() as f64 / b.token_cost as f64;
density_b.partial_cmp(&density_a).unwrap()
})
});

for section in sections {
if section.token_cost <= remaining {
remaining -= section.token_cost;
included.push(section);
} else if section.shrinkable && remaining > 0 {
// Trim shrinkable sections to fit
let trimmed = section.trim_to(remaining);
remaining -= trimmed.token_cost;
included.push(trimmed);
}
// else: skip
}

let assembled = match self.output_format {
OutputFormat::Toon => format_toon(&included),
OutputFormat::Raw => format_raw(&included),
};

Ok(AssembledContext {
content: assembled,
tokens_used: available - remaining,
tokens_available: available,
sections_included: included.len(),
sections_skipped: self.sections.len() - included.len(),
})
}
}

Usage Example

let context = ContextQueryBuilder::new(8192)
.reserve_response(2048)
.system_prompt("You are a helpful AI assistant.")
.skills(&selected_skills)
.history(&conversation_history)
.knowledge(&rag_results)
.build()?;

// context.content is now a token-budget-optimized string
// ready to be sent to the LLM

TOON Output Format

TOON (Token-Optimized Output Notation) is ClawDesk's compact serialization format for context sections. It achieves 58–67% fewer tokens compared to verbose JSON or Markdown.

Comparison

Standard JSON format:

{
"conversation_history": [
{
"role": "user",
"content": "What's the weather in London?",
"timestamp": "2026-02-17T10:30:00Z"
},
{
"role": "assistant",
"content": "The current weather in London is 12°C with partly cloudy skies.",
"timestamp": "2026-02-17T10:30:05Z"
}
]
}

Token count: ~85 tokens

TOON format:

[H]
U|10:30|What's the weather in London?
A|10:30|The current weather in London is 12°C with partly cloudy skies.

Token count: ~32 tokens

Savings: $\frac{85 - 32}{85} = 62%$

TOON Encoding Rules

ElementTOON SyntaxExample
Section header[X][H] = History, [S] = System, [K] = Knowledge
User messageU|time|textU|10:30|Hello
Assistant messageA|time|textA|10:30|Hi there
Tool callT|name|argsT|get_weather|{"city":"London"}
Tool resultR|name|resultR|get_weather|12°C cloudy
Knowledge chunkK|source|textK|wiki|London is the capital...
Metadata separator---Between sections

Implementation

/// In clawdesk-sochdb/src/toon.rs

pub fn format_toon(sections: &[ContextSection]) -> String {
let mut output = String::new();

for (i, section) in sections.iter().enumerate() {
if i > 0 {
output.push_str("---\n");
}

match section.priority {
Priority::Critical => {
output.push_str("[S]\n");
output.push_str(&section.content);
output.push('\n');
}
Priority::High if section.is_history() => {
output.push_str("[H]\n");
for entry in section.as_history() {
let role = match entry.role {
Role::User => "U",
Role::Assistant => "A",
Role::System => "S",
};
let time = entry.timestamp.format("%H:%M");
writeln!(output, "{}|{}|{}", role, time, entry.content).ok();
}
}
Priority::Medium if section.is_knowledge() => {
output.push_str("[K]\n");
output.push_str(&section.content);
output.push('\n');
}
_ => {
output.push_str(&section.content);
output.push('\n');
}
}
}

output
}

I/O Reduction Analysis

Single Session Lookup

Consider a conversation with 100 messages stored in SochDB. With a traditional approach:

Naive approach:

  1. Query database for all 100 messages → 100 rows
  2. Serialize each to JSON → ~100 × 200 = 20,000 tokens
  3. Send all 20,000 tokens to LLM
  4. LLM processes 20,000 context tokens + generates response

ContextQueryBuilder approach:

  1. Query SochDB with conversation_id index → 1 indexed lookup (vs. full scan)
  2. Priority filter: keep only last 20 messages → 4,000 tokens raw
  3. TOON format: compress to ~1,500 tokens
  4. LLM processes 1,500 context tokens + generates response

I/O reduction:

$$ \text{Token reduction} = \frac{20{,}000 - 1{,}500}{20{,}000} = 92.5% $$

Database I/O reduction:

For traditional key-value stores, fetching 100 messages requires 100 random reads. SochDB uses a single B-tree range scan:

$$ \text{I/O reduction} = \frac{100 \text{ random reads}}{1 \text{ range scan}} \approx 10{,}000\times $$

(One range scan touches ~10 pages vs. 100 random reads at 1 page each, but with cache misses the random approach is far worse.)

Token Cost Savings Over Time

For an active conversation with $N$ total messages and a budget of $B$ tokens:

ApproachTokens Sent to LLMCost at $$3/\text{M tokens}$ (1000 messages)
Send everything$O(N)$ → 200,000$0.60
Sliding window (last 20)$O(\min(N, 20))$ → 4,000$0.012
ContextQueryBuilder + TOON$O(\min(N, B))$ → 1,500$0.0045

Over 1,000 conversations/day:

$$ \text{Daily savings} = 1000 \times ($0.60 - $0.0045) = $595.50/\text{day} $$


Priority-Based Assembly Algorithm

The complete algorithm:

Complexity Analysis

StepTime Complexity
Token estimation$O(n)$ per section (character counting)
Priority sorting$O(m \log m)$ where $m$ = section count
Greedy packing$O(m)$
TOON formatting$O(T)$ where $T$ = total tokens
Total$O(m \log m + T)$

For typical values ($m \leq 20$ sections, $T \leq 8192$ tokens), this completes in microseconds.


Comparison to Alternatives

Hand-Rolled Compaction

Many LLM applications use ad-hoc string truncation:

# ❌ Typical hand-rolled approach
context = system_prompt
for msg in history[-20:]: # arbitrary cutoff
if len(context) + len(msg) < MAX_CHARS: # character-based, not token-based
context += msg

Problems:

  1. Character-based, not token-based (1 character ≠ 1 token)
  2. No priority weighting — recent history always wins
  3. No shrinkability — all or nothing per section
  4. No format optimization — raw text wastes tokens

LangChain's ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=20)

Better, but still:

  1. Fixed window size, not budget-aware
  2. No priority among different context sources
  3. No output format optimization

ClawDesk's ContextQueryBuilder

FeatureHand-rolledLangChainContextQueryBuilder
Token-based budgeting⚠️
Priority tiers
Shrinkable sections
Output optimization (TOON)
Type-safe builder API
Formal knapsack model

Key Takeaways

  1. Context assembly is a knapsack problem — formalize it, don't ad-hoc it
  2. Priority tiers beat flat ordering — system prompts must never be cut
  3. TOON format saves 58–67% tokens by eliminating JSON boilerplate
  4. Shrinkable sections (history) gracefully degrade under budget pressure
  5. SochDB indexed lookups reduce database I/O by orders of magnitude

Further Reading