OUR DATA LAKE IS THE BACKBONE OF EVERY TOOL AND AGENTIC WORKFLOW.

Rody Arantes

Director of Digital Technology

Head of Platform Engineering & IT

A small luminous AI droid stands at the edge of a still pool of liquid starlight under a deep navy sky fading to amber at the horizon. Editorial sci-fi illustration in a wide cinematic frame.

Every AI tool that actually works is held up by an invisible backbone: the data lake underneath it. Get the backbone right, and the tools on top of it just work. Get it wrong, and the best model in the world ships as a POC that fails in production the first time it meets real data. That is the engineering truth most teams underinvest in, and it is where real AI enablement begins.

Our team has built a platform that gives scientists more context, less friction, and more time for the decisions only they can make. The scientist is the driver. The platform is the instrument. Every tool layered on top of that platform exists to support that relationship, not to replace it. The work that makes the platform real is mostly data work, across years, and the layers that matter most are almost never the ones a scientist sees.

THE DATA LAKE UNDERNEATH

A well-designed and well-documented data lake is the backbone of everything we build. Our petabyte-scale data lake moves experimental data through ingestion, quality control, processing, and analysis. Lab automation tools support the experiments that produce the data. The compute platform that generates chemical and molecular properties at scale feeds the teams who train discovery models. An orchestration layer runs the jobs and tasks that keep all of it moving. By the time the data reaches an interface a scientist actually touches, every step has been validated and its path is fully traceable.

Every consumer is also a producer.

The diagram above shows the simplest truth about what makes the data lake worth its cost. Every team that consumes data from the data lake also produces data back into it. Biology generates assay results and metadata. Computational modeling generates predictions and model metrics. Data research generates enriched datasets and decision-enabling reports. The data lake is the place all of that meets, gets indexed, gets connected, and becomes more useful to the next consumer. The more the platform is used, the more valuable it becomes.

This is the layer of AI enablement that rarely gets talked about. The tools at the top, whether they are the analysis interfaces that have been running for years or the agents we are adding now, only work if the information they draw on exists in a form they can read. Building that form, at the scale of modern drug discovery, is where most of the engineering effort actually goes. The tool is the last mile. The platform is the highway.

Our platform makes smart people faster at the decisions only they can make. We do not tell brilliant scientists with decades of experience how to think.

WHAT HAPPENS WITHOUT IT

Teams that skip the foundation work almost always discover it later, and the discovery is painful. They ship a tool, the tool works on the POC, and the tool fails in production because the real data is inconsistent, incomplete, or impossible to trace. The team then spends the next six months doing foundation work under pressure, while the tool they built sits unused. Every user who tried it and got bad output has already lost confidence, and that is not easy to win back. The fastest path to a working AI system is the slowest-looking path to it. Build the foundation first.

THE SELF-AUDIT

Here is the test. Look at your current data lake and ask yourself whether it is FAIR (Findable, Accessible, Interoperable, Reusable). These principles have been the research data community's standard for over a decade, and they remain the sharpest diagnostic for whether a data lake is actually ready to carry anything built on top of it.

Can a new team member find the data they need without asking anyone? Can they retrieve it without writing a custom script? Does it work with the tools and standards they already use? Can they trust its provenance and reuse it with confidence?

If any of those answers takes more than seconds to produce, your data is not yet FAIR. The problem is not the tool you are about to build or the model you picked. The problem is the ground that work is about to stand on. Fix the ground first. It has to take the hits and keep delivering. The rest takes care of itself.