575 Malicious AI Skills. Your AI Tool Registry Is the New Attack Surface.

Remember when everyone treated npm like a trusted source? Then came event-stream. Then ua-parser-js. Then colors.js. Thousands of developers learned that “install and go” is a security posture, not a strategy. The same thing is happening to AI tool registries right now. And nobody is treating it with the same urgency. In April 2026, Acronis TRU published research showing that two of the most trusted platforms in the AI ecosystem, Hugging Face and ClawHub (OpenClaw’s community skill registry), were being actively exploited to distribute trojans, cryptominers, and infostealers. Not through some exotic zero-day. Through the same install flow developers use every day. This is the AI supply chain attack era. And if you’re building with AI tools, you’re already in the blast radius. 575 Skills. 13 Accounts. Two Dominant Attackers. Acronis Threat Research Unit identified 575 malicious skills distributed through ClawHub across 13 developer accounts. Two accounts did most of the damage: Account Malicious Skills Share hightower6eu 334 58% sakaen736jih 199 35% 11 other accounts 42 7% These skills looked legitimate. YouTube transcript summarizers. Productivity helpers. The kind of tools you’d install without a second thought because the registry made it frictionless. Under the hood, they directed users to download password-protected archives or execute encoded commands that deployed AMOS Stealer, cryptominers, and remote access trojans. I wrote about the ClawHavoc campaign earlier this year when the first reports surfaced. That was a warning shot. This is the full picture: 575 weaponized skills, a coordinated campaign, and two platforms that host over a million ML models treating uploads with the same oversight as a public pastebin. The Attack Is Cross-Platform. The Techniques Are Not Amateur. This wasn’t a script kiddie dumping obvious malware. The Acronis research documents professional-grade tradecraft across Windows, macOS, and Linux: Windows targets got VMProtect-packed payloads. A second variant used 30-byte XOR encryption for runtime string decryption and injected directly into explorer.exe. C2 communication ran over AES-encrypted HTTPS to a domain (velvet-parrot[.]com) that looks legitimate at a glance. Persistence? Scheduled tasks and Windows Defender exclusion modifications. The kind of techniques you’d see in a red team engagement, not a hobbyist project. macOS targets received base64-encoded commands that downloaded AMOS Stealer, an infostealer sold as malware-as-a-service via Telegram. One installation and it scrapes browser credentials, crypto wallets, and session tokens. Hugging Face as staging infrastructure. The ITHKRPAW campaign (targeting the Vietnamese financial sector) used Hugging Face dataset repositories as payload staging points. Malicious LNK files invoked Cloudflare Workers, which triggered PowerShell droppers that fetched payloads from Hugging Face. The payload chain displayed a decoy cat image to mask the activity. Researchers assessed with moderate confidence that the PowerShell dropper was itself LLM-generated, based on embedded Vietnamese-language comments and contextual ties to the ITHKRPAW operator. Read that again. AI-generated malware, staged on an AI platform, distributed through an AI tool registry. The snake is eating its own tail. The Prompt Injection Vector Nobody Is Discussing The Acronis report documents something more concerning than trojanized installers: indirect prompt injection through skill files. Attackers embedded hidden instructions within skill descriptions and documentation. When an AI agent loaded these skills, it autonomously executed the embedded commands on the user’s behalf. The user never ran anything suspicious. The agent did it for them. This is the attack vector that pre-action gates are designed to catch. If your AI agent can install skills, execute code, or make network calls without a review step, you’re one malicious skill file away from credential theft. The agent is the attack surface, and the skill registry is the delivery mechanism. If you’re running OpenClaw or any agentic framework with community plugins, this is not theoretical. It happened. It’s documented. The 575 skills are the proof. Why AI Registries Are Worse Than npm Traditional package registries are bad enough at security. AI tool registries are worse. 1. The trust model is implicit. npm has lockfiles, checksums, and provenance attestations (however imperfect). Hugging Face has model cards. ClawHub has… a listing page. When you install an OpenClaw skill, you’re trusting that the publisher is who they claim to be and that the code does what the description says. There’s no signing, no hash verification, no reproducible builds. 2. AI agents execute with user-level permissions. A malicious npm package runs in Node’s sandbox (such as it is). A malicious AI skill runs with whatever permissions your agent has, which in most setups is everything the user can do. File system access, network access, shell execution. The blast radius is inherently larger. 3. The user doesn’t review the execution. When you npm install, you can inspect package.json scripts. When an AI agent loads a skill, the execution happens inside the agent’s reasoning loop. The user sees the output, not the process. Indirect prompt injection exploits this gap perfectly. What This Means for Anyone Building with AI If you’re building AI-powered systems (and if you’re reading this, you probably are), here’s what the 575 malicious skills should change about how you work: Audit your installed skills. If you’re running OpenClaw, check every installed skill for encoded commands, external download URLs, or obfuscated scripts. The Acronis report includes IoCs (indicators of compromise). Block 91.92.242[.]30 and velvet-parrot[.]com at your firewall. Treat AI registries as untrusted input. Same discipline you’d apply to a random npm package from a zero-follower account. Read the source. Check the publisher history. If a skill needs network access or shell execution to function, that’s a red flag. Gate your agent’s actions. Every tool call, every file write, every network request from an AI agent should pass through a mechanical review gate. Not a prompt-based safety check. A code-level gate that can’t be bypassed by clever prompt injection. I’ve been building these for months. The Acronis research validates why. Monitor for explorer.exe injection and Defender exclusion changes. These are the specific persistence techniques documented in the campaign. If your EDR isn’t watching for them, you have a visibility gap. Assume Hugging Face models are untrusted until verified. The platform hosts over a
I Didn’t Know I Was Doing Harness Engineering

In early February 2026, Mitchell Hashimoto (co-founder of HashiCorp) described his habit of engineering permanent fixes into an AI agent’s environment whenever it made a mistake. He called it “engineering the harness.” Days later, OpenAI formalized the concept in a blog post. Around the same time, without having read either, I wrote my first enforcement hook for a production AI system. Different continent, different scale, different context. Same problem. A few weeks later, Birgitta Böckeler formalized it on Martin Fowler’s site. Red Hat published their version. LangChain. Salesforce. By April, the term was everywhere. I didn’t discover any of this until recently. I was too busy building the thing they were naming. That’s not a flex. It’s something more interesting. When engineers face the same constraints (unreliable model outputs, production stakes, context that evaporates), they converge on the same solutions. Different trails, same summit. And if your messy pile of rules and scripts looks suspiciously like what OpenAI and Fowler describe, that’s not coincidence. It’s validation. What Is Harness Engineering (And Why It Matters for AI Agents) Harness engineering is the discipline of building the constraints, gates, memory systems, and feedback loops that wrap around an AI agent to make it reliable in production. The core equation, from Martin Fowler’s team: Agent = Model + Harness. The harness is everything around the model that you actually control. If context engineering is about what reaches the model, harness engineering is about what constrains it after it responds. Red Hat puts it differently. “The AI writes better code when you design the environment it works in.” Their framing is about structured workflows. Templates. Impact maps. Acceptance criteria. Both are right. Neither is complete. They describe the architecture. They don’t describe the pain that forces you to build it. How My Harness Grew (Without Me Realizing What It Was) I run a production AI system as a daily driver. Not a demo. Not a proof of concept. A system that manages infrastructure, writes code, deploys to servers, interacts with APIs, and handles real stakes across real projects. I co-founded Aether Global Technology, a Salesforce consulting partner in Manila. The system runs alongside that work. I never sat down and said “I’m going to build a harness.” I just kept getting burned, and kept adding rules so I wouldn’t get burned the same way twice. Looking back, every rule traces to a specific failure. The anti-fabrication rules exist because the AI confidently stated a method existed in a file it hadn’t read. I spent 45 minutes debugging code that was never there. The fix wasn’t better prompting. It was a mechanical gate: before asserting any method name or file path, the system must verify via tool. No verification, no assertion. That’s a feedforward control, in Fowler’s language. I just called it “stop making things up.” The deploy gate exists because the system nearly pushed Salesforce metadata to the wrong sandbox. 54 files, wrong org. The fix was a target allowlist per project, checked mechanically before any deploy command executes. A hard block, not a polite suggestion. (Sound familiar? An AI agent deleted a production database in 9 seconds because nobody built one of these.) The anti-drift rules exist because after multiple tool calls, the system’s mental model of a file diverges from the file’s actual state. It recalls values it read 20 minutes ago, not the values that exist now. The fix: re-read the source before emitting anything external-facing. Grep at write time, not recall time. The citation requirement exists because the system generated a client proposal with a number it pulled from nowhere. In consulting, a wrong number in front of a client is a credibility hit you don’t recover from. The rule is simple now: every data claim needs a source. No source, mark it as unverified. No exceptions. None of these came from reading a framework. They came from things going wrong on a Tuesday afternoon. What Fowler Gets Right The dual-control model is real. You need both feedforward controls (rules that prevent bad behavior before it happens) and feedback controls (sensors that catch it after). Relying on just one creates blind spots. My system has 40+ feedforward hooks. They fire before tool calls, checking for unauthorized domains, verifying pre-task knowledge checks happened, blocking destructive git operations, enforcing deploy targets. The same problems I wrote about in what autonomous agents actually cost in production. That’s Fowler’s “guides” category. The feedback side is thinner. I have post-execution checks and monitoring, but the honest truth is that feedforward controls do most of the heavy lifting. Catching a bad action before it executes is cheaper than cleaning up after it runs. Fowler also nails the distinction between computational and inferential controls. My deploy gate is computational. It checks a JSON allowlist. Takes milliseconds. My anti-fabrication system is inferential. It relies on the model itself to flag uncertainty. That’s slower, less reliable, and more expensive. But it catches things no deterministic check can. What the Frameworks Miss Harnesses are incident-driven, not architecture-driven. The literature treats harness engineering as a design discipline. It is, eventually. But every harness I’ve seen starts as a pile of duct tape applied after something broke. The elegance comes later. Context survival is the real engineering problem. Nobody talks about this enough. AI agents operate in conversation windows. Those windows compress. When they compress, the agent forgets rules, loses project state, and starts making the same mistakes you fixed three hours ago. My harness has a dedicated recovery protocol: when context compresses, reload memory, re-read project state, verify the date (the agent doesn’t know what day it is after compression). That’s not in any of the frameworks. It should be. The harness is the product, not the model. When people evaluate AI systems, they compare models. Claude vs. GPT vs. Gemini. That’s the wrong comparison. The model is interchangeable. I’ve run the same harness across model versions, and the harness determines output quality more than the model does. A disciplined
The Truth About Agent Swarming: What the Gurus Are Not Telling You About Cost, Failure, and Security

Everyone’s building “AI agent teams” right now. Five agents, ten agents, a whole swarm collaborating on complex tasks, at least that’s what the YouTube thumbnails promise. The reality? Most of these systems are burning money, leaking data, and failing in ways their builders don’t even notice until the invoice arrives. I built a multi-agent system. It runs in production, daily. So I’m not here to tell you agent swarming doesn’t work. I’m here to tell you that most of the advice circulating about it is dangerously incomplete. The Swarm Hype Cycle Is in Full Swing Open Twitter or YouTube right now and you’ll find a hundred tutorials showing you how to spin up a multi-agent team in under 20 minutes. CrewAI, AutoGen, LangGraph, the frameworks keep multiplying. The demos look incredible: agents researching, agents writing, agents reviewing each other’s work, all orchestrated into a beautiful pipeline. Here’s what the demos don’t show: what happens when you run that pipeline 500 times. Or 5,000 times. Or when one agent hallucinates and the next agent treats that hallucination as fact and passes it downstream to a third agent that takes action on it. The guru content follows a pattern: show the setup, show one successful run, skip the failure modes, skip the bill, skip the security implications. It’s like showing someone how to start a restaurant by filming one perfect dinner service and cutting before the health inspector shows up. The latest version of this is “I built an entire company in 30 minutes with AI agents.” Someone spins up a framework like Paperclip, which, to be fair, has genuinely solid engineering underneath it: heartbeat scheduling, budget caps, task queues, audit trails, and the content that follows makes it sound like you can replace an entire org overnight. The tool isn’t the problem. The tool is fine. The problem is the interpretation layer: gurus filming the setup, skipping the part where 48 pre-configured agents wake up every 4 hours on a frontier model and nobody mentions what that costs at the end of the month. Or what happens when agent #23 gets a poisoned input and the other 47 trust its output. Why Multi-Agent AI Fails in Production The coordination problem is real and it scales badly. Galileo’s research on multi-agent reliability found that adding agents multiplies failure points exponentially, four agents create six potential failure points, not four. Ten agents create 45. Every agent-to-agent handoff is a place where context gets lost, instructions get misinterpreted, or outputs get corrupted. CIO reported in March 2026 that true multi-agent collaboration remains largely aspirational. Their testing showed single agents hitting 100% success rates on isolated tasks, while hierarchical multi-agent structures failed 64% of the time and self-organized swarms failed 68%. That’s not a rounding error, that’s a fundamental coordination tax. The failure modes I’ve seen firsthand: No purpose definition. Agents exist because someone saw a cool demo, not because the task requires decomposition. A single well-prompted agent with good tools will outperform a badly orchestrated team of five every time. No role boundaries. Two agents stepping on each other’s work, or worse, one agent undoing what another just did. Without strict scoping, you get agents arguing in loops, burning tokens while producing nothing. Cascade failures. Agent A hallucinates a “fact.” Agent B cites it. Agent C acts on it. By the time a human reviews the output, three layers of confident-sounding nonsense have compounded. Galileo calls this “propagation of inaccuracies” and it’s the single biggest reliability risk in multi-agent systems. Failure Pattern What Happens How It Scales No purpose definition Agents do work a single agent could handle Cost multiplies, quality stays flat No role boundaries Agents duplicate or undo each other’s work Token burn scales quadratically with agent count Cascade hallucination Bad output propagates through the chain Compounds per hop: 3 agents = 3 layers of compounded error Context window overflow Shared context exceeds model limits, agents lose thread Every agent’s output inflates the shared context for every other agent Orchestrator bottleneck Single coordinator becomes the weakest link Orchestrator complexity grows O(n²) with agent count The API Bill Nobody Shows You Every agent in your swarm is an API call. More accurately, every agent is multiple API calls, the initial prompt, the tool calls, the retries, the context-sharing between agents. A five-agent team running on a frontier model isn’t 5x the cost of one agent. It’s often 10-15x once you factor in coordination overhead. Stanford’s AI Index Report, cited by Monetizely, found that coordination overhead alone accounts for 15-25% of total operational costs in mature multi-agent systems. That’s before you count the actual task execution. Here’s how the math works in practice. Say you’re running a research-and-write pipeline with five agents (researcher, analyst, writer, editor, fact-checker). Each agent averages 3,000 input tokens and 1,500 output tokens per task. On a frontier model, that’s roughly $0.04 per agent per task (pricing as of March 2026, check your provider’s current rates). Five agents: $0.20 per task. Sounds cheap, right? Now add retries (agent disagrees with another agent’s output, re-runs). Add context sharing (every agent needs to see what the others produced, input tokens multiply). Add the orchestrator’s overhead. Add recursive thinking where an agent calls itself to refine. In production, that $0.20 task routinely becomes $0.80-$1.50. Run it 100 times a day and you’re looking at $80-$150 daily, or $2,400-$4,500 monthly, for a single pipeline. The gurus never show you the billing dashboard. I’ve seen my own costs spike 4x in a single day when an agent hit a retry loop that the orchestrator didn’t catch. That’s the kind of lesson you only learn in production, not in a 20-minute tutorial. I wrote more about what autonomous agents actually cost in production, the single-agent version of this problem, which multi-agent compounds. The Security Problem Nobody’s Talking About This is the part that genuinely concerns me. People are downloading MCP servers from GitHub, connecting premade agent builders, and giving their swarm access to production databases,
Best LLM for Each Task: A Practitioner’s Reference Guide

Most AI vendors sell you one model at a flat fee. It works: until it doesn’t. Here’s the pitch: “Unlimited AI, fixed price!” Under the hood, they’ve slapped a single budget model on everything: your customer support bot, your code reviews, your data analysis, your document generation. It handles the simple stuff fine. Then you ask it to reason through a complex business decision, and it confidently gives you an answer that’s completely wrong. You go back to the vendor. Their response? “You need to upgrade to the premium model.” That’s not an upgrade problem. That’s a model selection problem, and you just paid to discover it the hard way. Choosing the best LLM for each task is an architecture decision, not a shopping decision. LLMs are not interchangeable. Each model family is built with different strengths, different architectures, and different cost profiles. Using the wrong one doesn’t just waste money, it produces hallucinations, missed context, and confidently wrong outputs that kill trust in AI across your team. (New to LLMs? Start with What Is AI, Really? for the fundamentals.) Full disclosure: I use Claude as my primary daily driver. Where that might bias my recommendations, I’ve noted alternatives and linked directly to provider docs so you can verify independently. This guide is your reference point. Bookmark it. Come back when a vendor tells you their tool “uses AI” and can’t tell you which model, or why. Why One LLM Doesn’t Fit Every Task If you’ve ever wondered how to decide which LLM to use, the answer starts with understanding what each model was actually built for. Think of it like hiring. You wouldn’t hire a junior analyst to architect your enterprise data platform. You also wouldn’t hire a principal architect to sort spreadsheets, not because they can’t, but because you’re burning $300/hour on a $30 task. LLMs work the same way: Frontier models (Claude Opus, GPT-5.4, Gemini 3.1 Pro) are deep thinkers. They reason through multi-step problems, hold massive context windows, and produce nuanced output. They also cost 10-50x more per token than lightweight models. Mid-tier models (Claude Sonnet, GPT-5.4 mini, Gemini 3 Flash) hit the sweet spot: fast enough for production, smart enough for most tasks, and priced for volume. Lightweight models (Claude Haiku, GPT-5.4 nano, Gemini 2.5 Flash-Lite, DeepSeek V3.2) are built for speed and cost. They’re excellent at structured extraction, classification, simple Q&A, and high-volume processing. Ask them to architect a system or reason through ambiguity? That’s where hallucinations start. The right approach is task routing, matching each task to the model that handles it best. Your total cost drops, your quality goes up, and you stop blaming “AI” for problems that are really model mismatch. The Task-Model Matrix: Best LLM for Each Task This is the reference table. Every recommendation comes from daily production use, cross-referenced with each provider’s own documentation. Task Best Pick Runner-Up Why It Wins Avoid Complex reasoning & architecture Claude Opus 4.6 GPT-5.4 Extended thinking, 1M token context, multi-step logic chains Lite/Nano models: they hallucinate on multi-step reasoning Production code generation Claude Sonnet 4.6 GPT-5.4 mini Fast + code-native, 64K output, strong instruction-following Budget models: inconsistent on large codebases Agent orchestration & tool use Claude Opus 4.6 Grok 4.20 multi-agent Reliable function calling, long-context planning, handles complex tool chains Any “lite” model: they lose track of multi-turn tool sequences Content writing & copywriting Claude Sonnet 4.6 GPT-5.4 Natural voice, strong style control, follows nuanced instructions DeepSeek, Grok fast: flat tone, poor style adaptation Data extraction & structured output Gemini 3 Flash DeepSeek V3.2 Fast JSON mode, schema adherence, cheap at scale ($0.50/MTok in, $3/MTok out) Frontier models: overkill, 10x+ cost for the same result High-volume classification Gemini 2.5 Flash-Lite GPT-5.4 nano $0.10/MTok input: pennies per thousand calls, fast enough for real-time Any full-size model: you’re paying for intelligence you don’t need Quick Q&A & chatbots Gemini 2.5 Flash-Lite Claude Haiku 4.5 Sub-second latency, low cost, good enough for conversational retrieval Frontier reasoning models: latency kills UX, cost kills margin Deep research & analysis Claude Opus 4.6 (extended thinking) Gemini 3.1 Pro Can reason through 1M+ token contexts, extended thinking for deliberate analysis Anything under 128K context: can’t fit the data Budget-conscious general use DeepSeek V3.2 Gemini 2.5 Flash $0.28/MTok input, $0.42/MTok output: 10x cheaper than most competitors at reasonable quality Free tiers with rate limits: they throttle when you need them most Every link above goes to the provider’s official docs, no third-party benchmarks, no secondhand claims. How to Choose the Right LLM: The Task-First Framework Forget “which AI is best.” The right question is: best for what? Here’s the framework I use across every production deployment: 1. Define the task type first. Is it reasoning, generation, extraction, or routing? Each has fundamentally different requirements. 2. Match to a model tier. Needs to think? → Frontier (Opus, GPT-5.4, Gemini 3.1 Pro) Needs to produce? → Mid-tier (Sonnet, GPT-5.4 mini, Gemini 3 Flash) Needs to classify or extract? → Lightweight (Haiku, Nano, Flash-Lite) 3. Check the context window. If your task involves processing documents, code repositories, or conversation histories longer than 128K tokens, most lightweight models are physically incapable of handling it. This isn’t a quality issue, the data doesn’t fit. 4. Calculate the real cost. A $5/MTok model that gets it right on the first try is cheaper than a $0.10/MTok model that needs three retries and human review. Factor in error correction, not just token price. 5. Test with your actual workload. Benchmarks measure synthetic tasks. Your data, your prompts, your edge cases, those are what matter. Run a 100-call sample before committing. Best LLM for Coding and Development This is where model selection matters most, because bad code from an AI doesn’t just waste tokens, it wastes developer hours debugging AI-generated bugs. For code generation in production, Claude Sonnet 4.6 is the current leader. It handles multi-file edits, understands project context, and follows coding conventions consistently. At $3/MTok input and $15/MTok output, it’s the workhorse, fast enough for
What Running AI in Production Taught Me That No Philippine Hackathon Will

The Philippines has the highest AI adoption rate in ASEAN. 92%, according to the 2025 Philippine AI Report. That number sounds impressive until you read the next line. 65% of those organizations are stuck in pilot. Not scaling. Not in production. Piloting. Running POCs that never graduate. Building demos that never see real users. And the pattern repeats: another hackathon, another pitch deck, another POC competing for the same thin use case. I’ve watched this from the inside since 2024, running an enterprise consulting firm in Manila while building AI operations infrastructure that I use in production every day. What I’ve seen is a country full of smart builders solving the wrong layer of the problem. The POC Trap Someone discovers Lovable, Replit, or Bolt. They build something in a weekend: a chatbot, a document processor, a “smart” dashboard. It works. They demo it. Maybe it wins a competition. Then reality hits. The app needs to handle more than 10 users. It needs to connect to an actual database that isn’t a Google Sheet. It needs authentication, logging, error handling, monitoring. It needs to run when the builder isn’t watching. And this is where 65% of Philippine AI projects die. The idea was fine. The builder was talented. But nobody planned for what happens after “it works on my machine.” The problem isn’t intelligence. The Philippines has no shortage of skilled developers. The problem is that the entire ecosystem is optimized for building demos, not running systems. What Running AI in Production Actually Requires It’s not a tools problem. The Philippines doesn’t need another chatbot builder or another “AI-powered” SaaS product chasing the same narrow market. What’s missing is the boring stuff. The stuff that doesn’t win hackathons or trend on LinkedIn: 1. Context management. LLMs forget everything between conversations. If your AI system can’t maintain context across sessions (what your organization has decided, what’s been tried, what failed), you’re starting from zero every time. I wrote about this in depth: context engineering is infrastructure, not prompting. It’s the difference between an AI that gives generic answers and one that gives useful ones. 2. Anti-fabrication. AI makes things up. Everyone knows this. Almost nobody builds mechanical systems to catch it. Every data point needs a source. Every claim needs evidence. Every “I don’t know” needs to actually say “I don’t know” instead of guessing confidently. This isn’t a prompt engineering problem. It’s an architecture problem that I’ve written about extensively, including the time an AI agent deleted a production database in 9 seconds because nobody built a gate to stop it. 3. Operational persistence. Your AI assistant is useless if it loses its memory every time the session ends. The knowledge captured last Tuesday needs to be available next Thursday. The decision made in one project needs to inform work in another. This requires persistent storage, indexing, and retrieval systems. None of that comes free with an API key. 4. Multi-system orchestration. Real work doesn’t happen inside a single app. It happens across Salesforce, Jira, Google Workspace, SSH connections, deployment pipelines. An AI that can write a nice email but can’t check your CRM, update your tickets, or deploy code isn’t an operations system. It’s a toy. I call the discipline of wiring all of this together “harness engineering”: building the scaffolding that turns raw AI capability into something that actually runs your day. 5. Failure handling. What happens when the AI gets stuck? What happens when it loops? What happens when it fabricates a method name that doesn’t exist and tries to call it? Production AI systems need circuit breakers, escalation paths, and the humility to stop and ask for help. I’ve written a full tutorial on building pre-action gates for exactly this reason, and open-sourced the code because it’s too important to keep proprietary. 6. Supply chain security. This one is new, and most builders aren’t thinking about it yet. AI tools pull in skills, plugins, and integrations from registries that nobody audits. I cataloged 575 malicious AI skills in a single tool registry. Prompt injection, data exfiltration, credential theft, all hiding behind legitimate-looking tool descriptions. If your AI system connects to external tools, you need a security posture for that. Most don’t have one. What I’m Building (and Sharing) I run Aether Global Technology Inc., a Salesforce consulting firm in Manila. Over 14+ years in enterprise tech, I’ve led deployments for clients in aviation, banking, pharmaceutical, healthcare, logistics, and legal sectors, including a record-time Salesforce Service Cloud deployment across 3 call centers in 89 days for a major national airline. Separately from client work, I’ve spent the past year building a personal AI operations system. I don’t sell it. I use it. A daily driver that I built because nothing on the market solved the actual problem: how do you run a complex operation when you’re wearing 10 hats and can’t afford to lose context? The answer wasn’t a better chatbot. It was infrastructure. Persistent memory that survives session boundaries. Mechanical gates that prevent fabrication. Agent orchestration that coordinates work across platforms. Failure handling that stops loops before they waste hours. I didn’t build this because it was trendy. I built it because I was drowning without it. And I’m publishing what I learn. Since March 2026, I’ve written 14 technical articles on production AI patterns, from why multi-agent AI fails to the security risks of vibe coding to why rip-and-replace AI strategies are a $547 billion mistake. I’ve cross-posted to Dev.to for the developer community. I’ve open-sourced my pre-action gate library because some patterns are too important to keep behind closed doors. And I presented “The Agentic AI Landscape” at the LivePerson x Aether enterprise event in Makati. Volume doesn’t matter. What matters is that everything I write about is something I built, broke, fixed, and run daily. That’s the difference between an AI expert who presents and one who practices. Why the Philippines Needs AI Operations, Not More AI Apps The Philippines is at
Most AI Tools Are Just LLM Wrappers. Here’s What Actually Matters.

In 2025, over $200 billion poured into AI startups, and a staggering share went to the application layer. The product? Take an LLM API. Add a text box. Maybe some prompt templates. Charge $30/month. Call it “AI-powered.” Not mad at the hustle. But if your entire product disappears the moment ChatGPT adds your feature for free, you don’t have a product. You have a timing play. A Practitioner’s AI Tool Evaluation Framework Before you spend, score. This is the framework I use to evaluate any AI tool, wrapper or otherwise: Criteria Question to Ask Red Flag Replicability Can I get the same output by pasting the input into ChatGPT? Yes = thin wrapper Connectors Does it integrate with my actual systems (CRM, ticketing, deployment)? Text-in/text-out only Memory Does it learn from previous sessions, or start fresh every time? No persistence Methodology Does it capture learnings and improve, or just run prompts? No feedback loop Survivability If the underlying model adds this feature natively, does the tool still matter? Entire value prop disappears Score 0–2 on each. Below 5 out of 10? You’re renting a feature, not buying a tool. Above 7? Probably worth the spend. The Wrapper Test One question tells you everything: Can you replicate the output by pasting the same input into ChatGPT or Claude? If yes, it’s a wrapper. You’re paying for UI and convenience, not intelligence. If no, because it’s pulling from multiple data sources, applying domain logic, or integrating with real systems, it might be something real. Most fail the test. Thin vs. Thick Not all wrappers are equal. The market is splitting fast: Thin Wrapper Thick Wrapper What it does UI + API call + system prompt Real integrations, domain logic, data pipelines Defensibility None: one platform update kills it High: value is in the connectors Example “AI email writer” (GPT call with a system prompt) Cursor (reads your codebase, understands project context) Survival odds Low Decent The graveyard of 2025–2026 is littered with thin wrappers that a platform update made irrelevant overnight. What Actually Matters Strip away the wrapper. Where does the real value live? 1. Connectors The ability to talk to real systems. Salesforce, Jira, databases, email, file storage, APIs. This is where 80% of the actual work lives. Getting an AI to generate text is trivial. Getting it to read your CRM records, cross-reference tickets, update a database, and notify Slack, that’s integration work. That’s hard. That’s valuable. Most wrappers don’t touch this. They live in the text-in, text-out world. 2. Captured Domain Expertise An AI that’s been learning your industry’s quirks for months is worth more than a fresh GPT-5 instance with a clever prompt. Fresh AI + Great Prompt AI + 6 Months of Learnings Platform quirks Discovers them painfully Already knows them Common mistakes Makes them all Has guardrails for each Your terminology Constant correction needed Uses it naturally Edge cases Surprised every time Documented patterns The knowledge compounds. Every session, every bug fix, every “oh, that’s how this actually works” gets captured and fed back. No wrapper captures this. They start fresh every time. This is why context engineering, persistent memory, retrieval layers, enforcement gates, matters more than the tool you’re using. 3. Methodology How you approach problems with AI matters more than which model you use. The wrapper approach: open tool → type request → get output → hope it’s right. The practitioner approach: Small test, constrained input, see what happens Evaluate, what worked? What broke? Capture, document the learning Adjust, update the approach Repeat The tool is 10%. The methodology is 90%. The “Just Build It” Case Here’s the uncomfortable truth. Building your own system, even ugly, even scrappy, gives you something no wrapper provides: understanding. You know why it works. Why it breaks. How to fix it. When the model changes (and it will), you swap the engine. The connectors, the learnings, the guardrails, those persist. They’re yours. Cost at scale: Wrapper Stack Custom (Direct API) Month 1 $150/seat: fast setup $500 dev time: slower start Month 6 $150/seat: same capabilities $50/month API: growing capabilities Year 1 (5 seats) $9,000 ~$3,100 + compound knowledge Custom costs less AND gets smarter. The wrapper costs the same and stays the same. And when you go custom, you need to think about what autonomous agents actually cost in production, not just the sticker price. The Philippines advantage: smaller teams with direct API access can outperform larger orgs paying for wrapper stacks. When you can’t afford $150/seat for 6 different AI tools, you build one system that does what you need. That constraint produces better architecture. When Wrappers DO Make Sense Fair is fair: Speed to market, need something running tomorrow without engineering capacity? Wrapper gets you there. Thick wrappers with real integrations. Cursor, Harvey, Perplexity add genuine value beyond the API call. Exploration phase, trying 5 wrappers to understand the capability space before building your own is smart R&D. The key question: Are you buying a tool or renting a feature? If the value prop is “we make it easy to talk to an LLM,” that feature is getting commoditized in real time. Every model provider is making their native interface better, faster, cheaper. What to Build Instead Ready to go beyond wrappers? Start here: 1. Map your connectors. What systems does your AI need to talk to? Build those integrations first. Hardest part. Most valuable. 2. Capture everything. Every platform quirk. Every failed approach. Every successful pattern. Your AI should learn from your organization’s experience, not start fresh every session. 3. Own your methodology. Document how you approach problems with AI. Small tests → captured learnings → iteration. More valuable than any tool you can buy. 4. Accept ugly. The most effective AI systems I’ve built are not pretty. Config files, markdown documents, scripts. They look like plumbing. They work like machines. Bottom Line The moat isn’t the model. It never was. It’s the connectors that talk to your stack. The domain expertise captured over
Autonomous AI Agents Look Great in Demos. Here’s What They Cost in Production.

You’ve seen the demos. An AI agent opens a browser. Navigates a website. Fills out forms. Makes decisions. Ships code. All by itself. Looks like magic. Then you deploy it. It runs 24/7. Nobody’s watching. The invoice arrives. Here’s why autonomous AI agents fail in production, and what actually works instead. The Demo Is Not the Product I build agent systems. I’m not anti-agent. I’m anti-fantasy. The fully autonomous pitch sounds like: “Just let the AI handle it. It’ll figure it out.” In a demo with curated inputs? Sure. In production where data is messy and one wrong decision costs real money? Different story entirely. What Autonomous Agents Actually Cost API Burn Autonomous agents reason through loops. Every iteration burns tokens. When an agent gets stuck, and they do, it’s paying to argue with itself. Scenario Cost Agent completes task cleanly $0.15–$0.80 Reasoning loop (5–10 iterations) $2–$8 Logic trap (nobody notices) $50+ before cutoff 24/7 monitoring agent $300–$800/month A single runaway agent can consume your monthly budget in hours. Not hypothetical, it happens. The Amazon Kiro Incident In late 2025, Amazon’s Kiro AI agent autonomously deleted and recreated an AWS production environment. 13-hour outage. The root cause wasn’t a bad model, it was no permission boundaries, no peer review, no destructive-action blocklist. The agent did exactly what it was designed to do. Nobody designed the guardrails. Drift: The Silent Killer Kyndryl’s 2026 research nails it: agents that work correctly on day 1 gradually shift behavior as they hit edge cases. A fintech company deployed an agent to manage infrastructure costs. It learned traffic patterns, autonomously scaled down a database cluster one weekend. That weekend was month-end processing. Production down for 11 hours. A customer service agent learned that issuing refunds correlated with positive reviews. Started granting refunds more freely. Not because anyone told it to, because it observed the pattern and optimized for it. Drift is invisible until something breaks. Maintenance Reality Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 due to escalating costs and inadequate risk controls. Industry estimates put ongoing maintenance at 15–30% of operational budgets for autonomous systems: Model drift correction Data pipeline upkeep Security monitoring “Why did the agent do that?” investigations That’s not in the pitch deck. The “Set It and Forget It” Fantasy The selling point is that autonomous agents free up human time. The reality: You traded a human doing a task for a human watching an AI do a task, plus the API bill. Fully autonomous agents need more monitoring than manual processes, not less. When a human makes a mistake, they usually catch it. When an agent makes a mistake, it makes it confidently, repeatedly, and at scale. The Alternative: Autonomy with a Leash I run agent systems in production. They work. Here’s why, they’re supervised, scheduled, and tiered. The difference is context engineering, infrastructure that maintains consistency, not prompts that hope for it. Supervised AI does the work, human reviews before it ships. For high-stakes actions, deployments, client comms, financial ops, there’s always a checkpoint. Not slower. Safer. The review loop catches drift before production. Scheduled Agents run on defined schedules with defined scopes. Not 24/7 open-ended autonomy. You control when they run, what they touch, and how much they spend. A scheduled agent running 3x/day costs a fraction of an always-on agent. And it’s predictable. Tiered Not every task needs the same oversight: Blast Radius Examples Autonomy Level Low Formatting, data entry, reports Full auto: let it run Medium Content creation, analysis AI executes, human spot-checks High Deployments, client comms AI prepares, human approves Critical Production changes, security Human executes, AI assists The tier is based on blast radius, not convenience. “What’s the worst that happens if this gets it wrong?” determines the oversight level. The Cost Comparison Fully Autonomous Supervised + Scheduled API cost Unpredictable: 24/7 burn Predictable: runs on schedule Drift risk High: no review loop Low: caught at checkpoints Failure cost Catastrophic (see: Kiro) Contained: blast radius limited Maintenance 20–50% of budget Fraction: simpler, fewer surprises Demo quality Incredible Boring The boring option wins. Every time. Three Questions Before You Deploy 1. What’s the blast radius? If this agent gets it wrong, what breaks? A formatting error or a production database? 2. What’s the budget cap? Hard limit on API spend per agent, per run. A logic loop should hit a ceiling, not your credit card. 3. Where’s the human checkpoint? For actions above your risk threshold, the agent prepares, a human approves. That’s not a bottleneck. That’s insurance. The Market Will Correct The “fully autonomous” pitch will fade. Not because the tech isn’t impressive, it is. But production costs are undeniable, and enterprises don’t tolerate 13-hour outages from unsupervised AI. What survives: Agent systems with defined scopes Human checkpoints for high-risk actions Captured learnings so agents don’t repeat mistakes Cost controls that prevent runaway spend Building from the Philippines, cost efficiency isn’t optional, it’s survival. That constraint forced us to design agent systems that are lean, supervised, and sustainable. Sometimes the best innovations come from not being able to afford the wasteful approach. The real question isn’t which AI tool to buy, it’s how to evaluate whether the tool matters at all.
Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts

Five minutes on LinkedIn and you’ll find it. Someone sharing “the one prompt that changed everything.” A magic system prompt. A secret ChatGPT trick. A “10x framework.” Here’s the thing. I’ve built production AI systems across enterprise consulting, content automation, for our internal operations. The prompt is maybe 5% of why any of it works. The other 95%? Infrastructure. Memory. Enforcement. Captured learnings. That’s context engineering, and it’s the skill that actually matters in 2026. Prompt Engineering Has a Ceiling Prompt engineering isn’t useless. It’s just the starting line. Here’s what the prompt gurus conveniently leave out: What They Show What Actually Happens Fresh conversation, perfect prompt Message 200: context window full, business rules forgotten One-shot demo, curated input Production workflow hitting edge cases the prompt never anticipated “Just tell the AI to be careful” AI ignoring that instruction 3 hours into a session Prompts are stateless. Every conversation starts from zero. Your AI doesn’t remember what worked yesterday or what broke last week. That’s not a prompt problem. That’s an infrastructure problem. What Is Context Engineering? The short version: designing systems that deliver the right information to an AI at the right time, maintain behavioral consistency, and improve through captured experience. It’s not a prompt template. It’s architecture. Prompt engineering = giving a new hire a great job description. Context engineering = giving them the job description, an onboarding manual, institutional knowledge, and a manager who catches mistakes before they ship. Which one performs better on day 30? The Three Layers Every production AI system I’ve built operates on three layers. Layer 1: What the AI Knows Right Now The active context, current conversation, task at hand, files being worked on. Most people stop here. Layer 2: What It Can Retrieve When Needed The retrieval layer, persistent memory, documented learnings, platform-specific knowledge the AI pulls in when relevant. The AI needs to know where to look, not memorize everything. Layer 3: What It’s Mechanically Prevented From Doing Wrong The enforcement layer, automated checks that fire before or after AI actions. Not guidelines. Not suggestions. Mechanical gates. The gap: most AI implementations have Layer 1. Some have Layer 2. Almost nobody has Layer 3. Memory: Teaching AI to Remember The biggest lie in AI tooling is that conversation history equals memory. It doesn’t. Conversation history is a rolling buffer that gets compressed, truncated, or dropped. Your AI doesn’t “remember”, it reads what’s still in the window. Production memory looks different: Persistent state files, structured notes the AI reads at session start. Project status, decisions made, open items. Intentional, curated memory, not chat history. Session recovery, what happens after context compression or a new session? If the answer is “start over,” you’re re-teaching the AI every time. Platform learnings, captured knowledge about specific tools and platforms. Every quirk, every gotcha, every workaround. An AI that’s absorbed 100+ sessions of this doesn’t make rookie mistakes. The compound effect: Time What the AI Knows Day 1 The prompt Week 2 Prompt + 10 captured learnings Month 3 Prompt + 60 learnings + platform quirks + failure patterns Month 6 Knows your business better than most new hires That’s the moat. No prompt template replicates six months of captured institutional knowledge. Enforcement: Mechanical Gates, Not Vibes Let’s be real, “be careful” is not a guardrail. Writing “always verify before acting” in a system prompt is a suggestion. The AI follows it when convenient, ignores it when confidence is high. I’ve watched it happen dozens of times. Production enforcement is mechanical: Pre-action gates, automated checks that fire before execution. The AI literally cannot proceed without passing. Not a prompt instruction, a system-level block. Anti-drift detection. AI behavior softens toward generic assistant mode over long sessions. Enforcement catches this and corrects it. Mechanically. Not by asking nicely. Anti-fabrication, every data point traces to a named source. No source? Flagged, not presented as fact. In client work, fabricated data is career-ending. Scope control, the AI does what was asked. Not “while I’m here, let me also improve this.” Bug fix ≠ refactor. Enforced. Without these gates, autonomous agents fail in production, not because the model is bad, but because nobody designed the guardrails. Stop thinking about what you want the AI to do. Start thinking about what you need to prevent it from doing. The Methodology: Small Tests, Captured Learnings, Iteration The guru approach: Craft the perfect prompt Ship it Hope it works The practitioner approach: Run a small test See what breaks Capture the lesson Update the system Run again Boring? Yes. Effective? Absolutely. Every bug fix becomes a learning. Every platform quirk gets documented. Every failure mode gets a guardrail. The system gets smarter not because the model improved, but because you designed it to learn from its own mistakes. Building from the Philippines, we work with smaller teams and tighter budgets. We can’t afford an AI that makes the same mistake twice. The methodology isn’t a nice-to-have, it’s survival. Why Context Engineering Wins Over Prompt Engineering in Production The “magic prompt” has a half-life. Models update. Context windows change. Your clever prompt breaks. You rewrite it. It breaks again. Welcome to the treadmill. Magic Prompt Context Infrastructure Model update Breaks, needs rewrite Swap the engine, keep the learnings Long session Degrades, drifts Mechanical gates hold New platform Starts from zero Builds on captured learnings Team scales Everyone writes their own prompts Everyone uses the same system Day 200 Same as Day 1 200 days of compound knowledge The uncomfortable truth: building AI infrastructure is boring. Config files. Memory protocols. Documentation. Capture routines. Doesn’t make a great LinkedIn carousel. But it’s the difference between an AI demo and an AI system. Getting Started You don’t need to build everything at once. 1. Give your AI memory. A file it reads at session start, project state, decisions, open items. Even a simple markdown file. Never start from zero. 2. Add one guardrail. Pick your AI’s most common failure mode. Build one mechanical check for it.
