Neszed-Mobile-header-logo
Monday, December 15, 2025
Newszed-Header-Logo
HomeAIAI, MCP, and the Hidden Costs of Data Hoarding – O’Reilly

AI, MCP, and the Hidden Costs of Data Hoarding – O’Reilly

AI, MCP, and the Hidden Costs of Data Hoarding – O’Reilly

The Model Context Protocol (MCP) is genuinely useful. It gives people who develop AI tools a standardized way to call functions and access data from external systems. Instead of building custom integrations for each data source, you can expose databases, APIs, and internal tools through a common protocol that any AI can understand.

However, I’ve been watching teams adopt MCP over the past year, and I’m seeing a disturbing pattern. Developers are using MCP to quickly connect their AI assistants to every data source they can find—customer databases, support tickets, internal APIs, document stores—and dumping it all into the AI’s context. And because the AI is smart enough to sort through a massive blob of data and pick out the parts that are relevant, it all just works! Which, counterintuitively, is actually a problem. The AI cheerfully processes massive amounts of data and produces reasonable answers, so nobody even thinks to question the approach.

This is data hoarding. And like physical hoarders who can’t throw anything away until their homes become so cluttered they’re unliveable, data hoarding has the potential to cause serious problems for our teams. Developers learn they can fetch far more data than the AI needs and provide it with little planning or structure, and the AI is smart enough to deal with it and still give good results.

When connecting a new data source takes hours instead of days, many developers don’t take the time to ask what data actually belongs in the context. That’s how you end up with systems that are expensive to run and impossible to debug, while an entire cohort of developers misses the chance to learn the critical data architecture skills they need to build robust and maintainable applications.

How Teams Learn to Hoard

Anthropic released MCP in late 2024 to give developers a universal way to connect AI assistants to their data. Instead of maintaining separate code for connectors to let AI access data from, say, S3, OneDrive, Jira, ServiceNow, and your internal DBs and APIs, you use the same simple protocol to provide the AI with all sorts of data to include in its context. It quickly gained traction. Companies like Block and Apollo adopted it, and teams everywhere started using it. The promise is real; in many cases, the work of connecting data sources to AI agents that used to take weeks can now take minutes. But that speed can come at a cost.

Let’s start with an example: a small team working on an AI tool that reads customer support tickets, categorizes them by urgency, suggests responses, and routes them to the right department. They needed to get something working quickly but faced a challenge: They had customer data spread across multiple systems. After spending a morning arguing about what data to pull, which fields were necessary, and how to structure the integration, one developer decided to just build it, creating a single getCustomerData(customerId) MCP tool that pulls everything they’d discussed—40 fields from three different systems—into one big response object. To the team’s relief, it worked! The AI happily consumed all 40 fields and started answering questions, and no more discussions or decisions were needed. The AI handled all the new data just fine, and everyone felt like the project was on the right track.

Day two, someone added order history so the assistant could explain refunds. Soon the tool pulled Zendesk status, CRM status, eligibility flags that contradicted each other, three different name fields, four timestamps for “last seen,” plus entire conversation threads, and combined them all into an ever-growing data object.

The assistant kept producing reasonable-looking answers, even as the data it ingested kept growing in scale. However, the model now had to wade through thousands of irrelevant tokens before answering simple questions like “Is this customer eligible for a refund?” The team ended up with a data architecture that buried the signal in noise. That additional load put stress on the AI to dig out that signal, leading to serious potential long-term problems. But they didn’t realize it yet, because the AI kept producing reasonable-looking answers. As they added more data sources over the following weeks, the AI started taking longer to respond. Hallucinations crept in that they couldn’t track down to any specific data source. What had been a really valuable tool became a bear to maintain.

The team had fallen into the data hoarding trap: Their early quick wins created a culture where people just threw whatever they needed into the context, and eventually it grew into a maintenance nightmare that only got worse as they added more data sources.

The Skills That Never Develop

There are as many opinions on data architecture as there are developers, and there are usually many ways to solve any one problem. One thing that almost everyone agrees on is that it takes careful choices and lots of experience. But it’s also the subject of lots of debate, especially within teams, precisely because there are so many ways to design how your application stores, transmits, encodes, and uses data.

Most of us fall into just-in-case thinking at one time or another, especially early in our careers—pulling all the data we might possibly need just in case we need it rather than fetching only what we need when we actually need it (which is an example of the opposite, just-in-time thinking). Normally when we’re designing our data architecture, we’re dealing with immediate constraints: ease of access, size, indexing, performance, network latency, and memory usage. But when we use MCP to provide data to an AI, we can often sidestep many of those trade-offs…temporarily.

The more we work with data, the better we get at designing how our apps use it. The more early-career developers are exposed to it, the more they learn through experience why, for example, System A should own customer status while System B owns payment history. Healthy debate is an important part of this learning process. Through all of these experiences, we develop an intuition for what “too much data” looks like—and how to handle all of those tricky but critical trade-offs that create friction throughout our projects.

MCP can remove the friction that comes from those trade-offs by letting us avoid having to make those decisions at all. If a developer can wire up everything in just a few minutes, there’s no need for discussion or debate about what’s actually needed. The AI seems to handle whatever data you throw at it, so the code ships without anyone questioning the design.

Without all of that experience making, discussing, and debating data design choices, developers miss the chance to build critical mental models about data ownership, system boundaries, and the cost of moving unnecessary data around. They spend their formative years connecting instead of architecting. This is another example of what I call the cognitive shortcut paradox—AI tools that make development easier can prevent developers from building the very skills they need to use those tools effectively. Developers who rely solely on MCP to handle messy data never learn to recognize when data architecture is problematic, just like developers who rely solely on tools like Copilot or Claude Code to generate code never learn to debug what it creates.

The Hidden Costs of Data Hoarding

Teams use MCP because it works. Many teams carefully plan their MCP data architecture, and even teams that do fall into the data hoarding trap still ship successful products. But MCP is still relatively new, and the hidden costs of data hoarding take time to surface.

Teams often don’t discover the problems with a data hoarding approach until they need to scale their applications. That bloated context that barely registered as a cost for your first hundred queries starts showing up as a real line item in your cloud bill when you’re handling millions of requests. Every unnecessary field you’re passing to the AI adds up, and you’re paying for all that redundant data on every single AI call.

Any developer who’s dealt with tightly coupled classes knows that when something goes wrong—and it always does, eventually—it’s a lot harder to debug. You often end up dealing with shotgun surgery, that really unpleasant situation where fixing one small problem requires changes that cascade across multiple parts of your codebase. Hoarded data creates the same kind of technical debt in your AI systems: When the AI gives a wrong answer, tracking down which field it used or why it trusted one system over another is difficult, often impossible.

There’s also a security dimension to data hoarding that teams often miss. Every piece of data you expose through an MCP tool is a potential vulnerability. If an attacker finds an unprotected endpoint, they can pull everything that tool provides. If you’re hoarding data, that’s your entire customer database instead of just the three fields actually needed for the task. Teams that fall into the data hoarding trap find themselves violating the principle of least privilege: Applications should have access to the data they need, but no more. That can bring an enormous security risk to their whole organization.

In an extreme case of data hoarding infecting an entire company, you might discover that every team in your organization is building their own blob. Support has one version of customer data, sales has another, product has a third. The same customer looks completely different depending on which AI assistant you ask. New teams come along, see what appears to be working, and copy the pattern. Now you’ve got data hoarding as organizational culture.

Each team thought they were being pragmatic, shipping fast, and avoiding unnecessary arguments about data architecture. But the hoarding pattern spreads through an organization the same way technical debt spreads through a codebase. It starts small and manageable. Before you know it, it’s everywhere.

Practical Tools for Avoiding the Data Hoarding Trap

It can be really difficult to coach a team away from data hoarding when they’ve never experienced the problems it causes. Developers are very practical—they want to see evidence of problems and aren’t going to sit through abstract discussions about data ownership and system boundaries when everything they’ve done so far has worked just fine.

In Learning Agile, Jennifer Greene and I wrote about how teams resist change because they know that what they’re doing today works. To the person trying to get developers to change, it may seem like irrational resistance, but it’s actually pretty rational to push back against someone from the outside telling them to throw out what works today for something unproven. But just like developers eventually learn that taking time for refactoring speeds them up in the long run, teams need to learn the same lesson about deliberate data design in their MCP tools.

Here are some practices that can make those discussions easier, by starting with constraints that even skeptical developers can see the value in:

  • Build tools around verbs, not nouns. Create checkEligibility() or getRecentTickets() instead of getCustomer(). Verbs force you to think about specific actions and naturally limit scope.
  • Talk about minimizing data needs. Before anyone creates an MCP tool, have a discussion about what the smallest piece of data they need to provide for the AI to do its job is and what experiments they can run to figure out what the AI truly needs.
  • Break reads apart from reasoning. Separate data fetching from decision-making when you design your MCP tools. A simple findCustomerId() tool that returns just an ID uses minimal tokens—and might not even need to be an MCP tool at all, if a simple API call will do. Then getCustomerDetailsForRefund(id) pulls only the specific fields needed for that decision. This pattern keeps context focused and makes it obvious when someone’s trying to fetch everything.
  • Dashboard the waste. The best argument against data hoarding is showing the waste. Track the ratio of tokens fetched versus tokens used and display them in an “information radiator” style dashboard that everyone can see. When a tool pulls 5,000 tokens but the AI only references 200 in its answer, everyone can see the problem. Once developers see they’re paying for tokens they never use, they get very interested in fixing it.

Quick smell test for data hoarding

  • Tool names are nouns (getCustomer()) instead of verbs (checkEligibility()).
  • Nobody’s ever asked, “Do we really need all these fields?”
  • You can’t tell which system owns which piece of data.
  • Debugging requires detective work across multiple data sources.
  • Your team rarely or never discusses the data design of MCP tools before building them.

Looking Forward

MCP is a simple but powerful tool with enormous potential for teams. But because it can be a critically important pillar of your entire application architecture, problems you introduce at the MCP level ripple throughout your project. Small mistakes have huge consequences down the road.

The very simplicity of MCP encourages data hoarding. It’s an easy trap to fall into, even for experienced developers. But what worries me most is that developers learning with these tools right now might never learn why data hoarding is a problem, and they won’t develop the architectural judgment that comes from having to make hard choices about data boundaries. Our job, especially as leaders and senior engineers, is to help everyone avoid the data hoarding trap.

When you treat MCP decisions with the same care you give any core interface—keeping context lean, setting boundaries, revisiting them as you learn—MCP stays what it should be: a simple, reliable bridge between your AI and the systems that power it.

Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments