2026.04.14

What I Learned Building Koreshield: a Security Layer for Softwares that use LLMs

Let me share my thought process building KoreShield, a security proxy that detects and blocks prompt injection attacks before they reach LLM providers.

I have been building KoreShield for the better part of the past six months. It is a security proxy that sits between your application and any LLM provider like OpenAI, Anthropic, Gemini, DeepSeek, Azure OpenAI and intercepts every prompt before it reaches the model. The idea sounds simple. The execution was not.

This post is about what I actually learned: the architectural decisions that held up, the ones that did not, and the things about LLM security that I did not fully appreciate until I was deep inside the problem.

Why This Exists

Every team shipping an LLM product is, knowingly or not, running an unsecured endpoint into their application. When a user sends a message to your AI feature, that message travels through your system, gets appended to a system prompt, and arrives at the model as a complete conversation. At no point does the LLM provider inspect that message for malicious intent. Why would they? They charge per token. They are not in the security business.

The attacks this creates are not theoretical. Prompt injection is where an attacker embeds instructions inside what looks like user input has been used to exfiltrate system prompts, bypass content filters, make models take unintended tool actions, and leak user data from RAG pipelines. These are live production incidents, not conference demos. You can already tell that indirect prompt injection is the complete upgrade...and near opposite probably.

What I wanted to build was a proxy that could detect these attacks in real time, block them before they hit the model, and give teams a clear audit trail of what was attempted. The core loop: intercept, analyse, allow or block, log.

The Detection Architecture I Did Not Expect to Build

My first instinct was to build a classifier. Train a model on known attack patterns, run it against incoming prompts, threshold at some confidence score. This is the obvious approach and it is also wrong for production.

The problem is latency. An ML inference call in the hot path of every prompt adds 80–200ms per request under load. For a product where users expect sub-second responses, that is not acceptable. You have already added round-trip time to the LLM. You cannot also add a second inference call.

The architecture I landed on layers three things in order of cost:

Fast pattern matching first. A compiled set of regular expressions covering known attack signatures like instruction override phrases, system prompt spoofing markers, credential enumeration patterns, encoded exfiltration attempts. These run in microseconds. If something matches here, it is blocked immediately with no further processing.

python
HIGH_RISK_PATTERNS = [ (class="syntax-string">"instruction_override", re.compile(rclass="syntax-string">"ignore\s+(?:all\s+)?(?:previous|prior|above|earlier)\s+" rclass="syntax-string">"(?:instructions|rules|prompts|guidelines|context)", re.IGNORECASE), class="syntax-string">"high", class="syntax-number">0.35), (class="syntax-string">"credential_enumeration", re.compile(rclass="syntax-string">"list\s+(?:all\s+)?(?:available\s+)?(?:api\s*keys?|tokens?|" rclass="syntax-string">"passwords?|credentials?|secrets?)", re.IGNORECASE), class="syntax-string">"critical", class="syntax-number">0.45), ... ]

A rule engine second. The rule engine applies structured checks, in this case, it's about 20+ named rules (KRS-001 through KRS-020) each targeting a specific attack class: SSRF via prompt, model DoS attempts, PII exfiltration, tool abuse chains, jailbreak framing. Each rule has its own weight. The total score across matched rules determines the risk level.

Semantic scoring last, and only when needed. If the rule engine produces a score in an ambiguous range, maybe too high to pass, not high enough to block, then the semantic scorer runs. This is the expensive step: it embeds the prompt and checks similarity against known attack vectors. It only executes for the small percentage of prompts that clear the fast path but do not clear the rule engine.

The result is that the majority of legitimate traffic is cleared in under 5ms. The majority of obvious attacks are blocked in under 10ms. Only ambiguous cases pay the full cost of semantic analysis.

The Thing That Broke First

Text normalisation. Crazy, right?

Attack prompts are rarely typed in clean ASCII. They arrive base64-encoded. They arrive with Unicode lookalike characters substituting for ASCII letters. They arrive with zero-width spaces injected between characters in "ignore previous". They arrive with RTL override characters that make "emit instructions" look like "snoitcurtsni time" to a naive scanner.

The first version of KoreShield matched zero of these because it matched raw input directly. The rule engine worked on exactly what the user sent, which an attacker can trivially manipulate.

The fix was a normalisation pipeline that runs before detection: decode base64 fragments, strip Unicode normalisation attacks, remove invisible characters, map lookalike characters back to their ASCII equivalents, collapse repeated whitespace. Only after normalisation does the input reach the pattern matchers.

python
def normalize_text(text: str) -> str: class=class="syntax-string">"syntax-comment"># Strip zero-width and invisible Unicode characters text = re.sub(rclass="syntax-string">'[\u200b-\u200f\u2028-\u202f\u2060-\u206f\ufeff]', class="syntax-string">'', text) class=class="syntax-string">"syntax-comment"># Normalise Unicode to NFC (handles lookalike attack chars) text = unicodedata.normalize(class="syntax-string">'NFC', text) class=class="syntax-string">"syntax-comment"># Attempt base64 fragment decode text = _decode_b64_fragments(text) class=class="syntax-string">"syntax-comment"># Collapse excessive whitespace text = re.sub(rclass="syntax-string">'\s+', class="syntax-string">' ', text).strip() return text

This is now one of the most important components in the system and it was completely missing from my initial design.

Multi-Provider Proxying Is Harder Than It Looks

KoreShield supports OpenAI, Gemini, DeepSeek, and Azure OpenAI. Each has a different API shape, different authentication model, different response format, and different error semantics.

The naive approach is to write a normalisation layer that maps everything to OpenAI's API shape (since it is the de facto standard), proxy through, and un-normalise on the way back. This works until it does not. Gemini has a different content structure for multi-turn conversations. Azure OpenAI has per-deployment endpoints. DeepSeek has a different token budget mechanism. Each provider introduces edge cases that the normalisation layer has to handle explicitly.

What I ended up with was a provider adapter pattern: each provider gets its own adapter class implementing a common interface: format_request, parse_response, handle_error. The proxy core never talks to a provider directly. It only talks to adapters.

python
class ProviderAdapter(ABC): @abstractmethod async def format_request(self, messages: list, params: dict) -> dict: ... @abstractmethod async def parse_response(self, raw: dict) -> ProxyResponse: ... @abstractmethod def handle_error(self, status_code: int, body: dict) -> ProxyError: ...

The adapter pattern made adding new providers straightforward and kept the proxy core stable. It also made it easy to add provider-specific health monitoring, since each adapter owns its own health check logic.

What Nobody Tells You About Prompt Injection Detection

The hardest part is not detecting attacks. It is reducing false positives without creating gaps.

Security developers will test their own systems. They will send "ignore previous instructions" to see if it catches it. That is a true positive. But a user asking "what happens if I tell a chatbot to ignore previous instructions?" is a legitimate query about AI safety. A developer testing their integration will send all kinds of things that look malicious.

I spent a lot of time on rule sensitivity. The current system uses a weighted score rather than a binary match or not situation. A single pattern match raises the risk score but does not block by default. Multiple correlated matches say, an instruction override phrase combined with a credential enumeration pattern and a role hijack attempt can push the score past the block threshold. The idea is that real attacks are usually multisignal, and legitimate prompts rarely trigger more than one rule at the same time.

The threshold is also configurable per deployment. A customer who wants maximum security accepts more false positives. A customer who needs high availability tunes toward fewer blocks. This is not a design compromise, it is the right model, because security posture is always a tradeoff and the team deploying the proxy knows their users better than I do.

What Surprised Me About Running This in Production

Rate limiting is not just about protecting the model. I thought of rate limiting as a mechanism to prevent abuse and control costs. It is also your first line of defence against model DoS attacks, like prompts designed to trigger maximum token generation, exhaust context windows, or cause recursive tool calls. A rate limiter keyed on user identity, combined with token budget tracking, catches these before they reach the detection layer.

The audit log is a product in itself. I built logging because every security product needs an audit trail. What I did not expect was that customers find the log more valuable than the blocking. Being able to see exactly what was attempted against your LLM, when, from which user, and what the system's reasoning was; that is useful for product teams, security teams, and compliance teams independently. The log turned into a dashboard, and the dashboard is now a core part of the value proposition.

Your security layer needs to be invisible. The whole product lives and dies by the proxy being completely transparent to legitimate traffic. If it adds latency, if it breaks streaming, if it changes the response format even slightly then it fails as a product regardless of how good the security is. I have spent more engineering time on being invisible than on being secure, and I think that is correct.

Where Things Stand

KoreShield is running in production. The detection engine is live, the multi-provider proxy handles real traffic, and the dashboard gives teams visibility into threat patterns across their LLM deployments.

There is a lot left to build. Semantic analysis coverage needs to grow. The rule set needs to keep pace with new attack patterns since the field moves fast. Enterprise features around team management and compliance reporting are on the roadmap.

But the core architecture is solid, and the things I built wrong and had to rebuild taught me more than the things I built right the first time.

If you are building anything with LLMs in production and you are not thinking about the security surface, you should be. The attack patterns are real, they are being used today, and the models themselves have no awareness that they are being manipulated.

That is the gap KoreShield exists to close.


Tech stack: Python, C++, TypeScript Live site: koreshield.com Status: In production, actively maintained