Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

117 points by jeffreyip 5 months ago

Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (https://confident-ai.com). This is the cloud platform for DeepEval (https://github.com/confident-ai/deepeval), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs.

We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, AstraZeneca, AXA, and Capgemini. But the fact that DeepEval simply runs, and does nothing with the data afterward, isn’t the best experience. If you want to inspect failing test cases, identify regressions, or even pick the best model/prompt combination, you need more than just DeepEval. That’s why we built a platform around it.

Here’s a quick demo video of how everything works: https://youtu.be/PB3ngq7x4ko

Confident AI is great for RAG pipelines, agents, and chatbots. Typical use cases involve allowing companies to switch the underlying LLM, rewrite prompts for newer (and possibly cheaper) models, and keep test sets in sync with the codebase where DeepEval tests are run.

Our platform features a "dataset editor," a "regression catcher," and "iteration insights". The datasets editor in Confident AI allows domain experts to edit datasets while keeping them in sync with your codebase for evaluation. We’ll then generate sharable LLM testing/benchmark reports once DeepEval has finished running evaluations on these datasets that are pulled from the cloud. The regression catcher then identifies any regressions in your new implementation, and we use these evaluation results to determine the best iteration based on your metric scores.

Our goal is to make benchmarking LLM applications so reliable that picking the best implementation is as simple as reading the metric values off the dashboard. To achieve this, the quality of curated datasets and the accuracy and reliability of metrics must be the highest possible.

This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

To address this, we recently released a DAG (Directed Acyclic Graph) metric in DeepEval. It is a decision-tree-based, LLM-as-a-judge metric that provides deterministic results by breaking a test case into finer atomic units. Each edge represents a decision, each node represents an LLM evaluation step, and each leaf node returns a score. It works best in scenarios where success criteria are clearly defined, such as text summarization.

The DAG metric is still in its early stages, but our hope is that by moving towards better, code-driven, open-source metrics, Confident AI can deliver deterministic LLM benchmarks that anyone can blindly trust.

We hope you’ll give Confident AI a try. Quickstart here: https://docs.confident-ai.com/confident-ai/confident-ai-intr...

The platform runs on a freemium tier, and we've dropped the need to signup with a work email for the next four days.

Looking forward to your thoughts!

codelion 5 months ago

The DAG feature for subjective metrics sounds really promising. I've been struggling with the same "good email" problem. Most of the existing benchmarks are too rigid for nuanced evaluations like that. Looking forward to seeing how that part of DeepEval evolves.

jeffreyip 5 months ago

Definitely, feel free to join our discord for any questions on it: https://discord.com/invite/a3K9c8GRGt

llm_trw 5 months ago

>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

Have you done any work on dynamic data generation?

I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory.

jeffreyip 5 months ago

Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?
Do do synthetic data generation for custom application use cases. Such as RAG, summarization, text-sql, etc. We call this module the "synthesizer", and you can customize your data generation pipeline however you want (I think, let me know otherwise!).
Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.
- llm_trw 5 months ago
  
  >Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?
  Short version: if a model can answer a very high proportion of questions from a benchmark accurately then the next step is to ask it two or more questions at a time. On some models the quality of answers varies dramatically with which is asked first.
  >Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.
  Very good start, but the statistics of the generated text matter a lot.
  As an example on a dumb as bricks benchmark I've designed I can saturate the reasoning capabilities of all non-reasoning models just by varying the names of objects in the questions. A model that could get a normalized score of 14 with standard object strings could get a score as high as 18 with one letter strings standing for objects and as low as zero with arbitrary utf-8 character strings - which turns out mattered a lot since all the data was polluted with international text coming from the stock exchanges.
  Feel free to drop me a line if you're interested in a more in depth conversation. LLMs are _ridiculously_ under tested for how many places they show up in.
  - jeffreyip 5 months ago
    
    Hey yes would definitely love to, my contact info is in my bio, please drop me an email :)

fullstackchris 5 months ago

Congrats guys! Back in the spring of last year I did an initial spike investigating tools that could evaluate the accuracy of responses in our RAG queries where I work. We used your services (tests and test dashboard) as a little demo.

jeffreyip 5 months ago

That's great! Hope you enjoyed it :)

nisten 5 months ago

This looks nice and flashy for an investor presentation, but practically I just need the thing to work off of an API or if it is all local to at least have vllm support so it doesn't take 10 hours to run a bench.

The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc.

And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something.

Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command.

jeffreyip 5 months ago

I see, although most users come to us for evaluating LLM applications, you're correct that the academic benchmarking of foundational models is also offered in DeepEval, which I'm assuming what you're talking about.
We actually designed it to make it easily work off any API. How it works is you just have to create a wrapper around your API and you're good to go. We take care of the async/concurrent handling of such benchmarking so the evaluation speed is really just limited by the rate limit of your LLM API.
This link shows what a wrapper looks like: https://docs.confident-ai.com/guides/guides-using-custom-llm...
And once you have your model wrapper setup, you can use any benchmark we provide.

pantsforbirds 5 months ago

Does DeepEval allow you to set up custom metrics without an LLM-as-a-judge base?

If I want my result to be a JSON output, and I want to weight the keys based on some specific importance weighting, can I write a Python function/class to calculate and average those weighted scores as a metric for DeepEval?

I do have some annoyances with DSPy, but I think their approach to defining evals is decent.

jeffreyip 5 months ago

You sure can! A few lines of code is all it takes, and a few simple rules to follow as shown here: https://docs.confident-ai.com/guides/guides-building-custom-...
If you're using DSPy, you can also include it directly in this custom metric from the link above. It's hard for me to say 100% if there are advantages of doing this within DeepEval, but 8/10 times running evals in our ecosystem brings you more benefits than drawbacks. Let me know if you have trouble setting up!

tracyhenry 5 months ago

This looks great. I would love to know more what makes Confident AI/DeepEval special compared to tons of other LLM Eval tools out there.

jeffreyip 5 months ago

Thanks and great question! There's a ton of eval tools out there but there are only a few that actually focuses on evals. The quality of LLM evaluation depends on the quality of dataset and the quality of metrics, and so tools that are more focused on the platform side of things (observability/tracing) tend to fall short on the ability to do accurate and reliable benchmarking. What tends to happen for those tools are users use them for one-off debugging, but when errors only happen 1% of the time, there is no capability for regression testing.
Since we own the metrics and the algorithms that we've spent the last year iterating on with our users, we balance between giving engineers the ability to customize our metric algorithms and evaluation techniques, while offering the ability for them to bring it to the cloud for their organization when they're ready.
This brings me to the tools that does have their own metrics and evals. Including us, there's only 3 companies out there that does this to a good extent (excuse me for this one), and we're the only one with a self-served platform such that any open-source user can get the benefit of Confident AI as well.
That's not all the difference, because if you were to compare DeepEval's metrics on more nuance details (which I think is very important), we provide the most customizable metrics out there. This includes researched-backed SOTA LLM-as-a-judge G-Eval for any criteria, and the recently released DAG metric that is a decision-based that is virtually deterministic despite being LLM-evaluated. This means as user's use cases get more and more specific, they can stick with our metrics and benefit from DeepEval's ecosystem as well (metric caching, cost tracking, parallelization, integrated with Pytest for CI/CD, Confident AI, etc)
There's so much more, such as generating synthetic data to get started with testing even if you don't have a prepared test set, red-teaming for safety testing (so not just testing for functionality), but I'm going to stop here for now.

eamag 5 months ago

What happened to openlayer? https://news.ycombinator.com/item?id=38532593

stereobit 5 months ago

DAG sounds interesting. Might help me to solve my biggest challenge with evals right now, which is testing subjective metrics e.g. “is this a good email”

jeffreyip 5 months ago

Do check it out, the early feedback has been great: https://docs.confident-ai.com/docs/metrics-dag

TeeWEE 5 months ago

Was also looking at Langfuse.ai or braintrust.dev

Anybody with experience can give me a tip of the best way to - evaluate - manage prompts - trace calls

jeffreyip 5 months ago

It's actually langfuse.com! Our quickstart walks you through the whole process: https://docs.confident-ai.com/confident-ai/confident-ai-intr...
resiros 5 months ago

Check out agenta.ai it's OSS and provide tools for prompt management, evaluation and observability (with otel). I think it's easy to get started.

jchiu220 5 months ago

This is an awesome tool! Been using it since day 1 and will keep using it. Would recommend to anyone looking for an LLM Eval tool

avipeltz 5 months ago

this is sick, all star founders making big moves ;)

calebkaiser 5 months ago

dang 5 months ago

It's probably better not to do marketing on top of someone else's launch.
https://x.com/mitchellh/status/1759626842817069290?s=46&t=57... (via https://news.ycombinator.com/item?id=40813750)
- calebkaiser 5 months ago
  
  Fair enough. Happy to scrub.
  - dang 5 months ago
    
    Thanks—I know these things aren't necessarily ill-intentioned but I think it's good for community to let each project/startup have its launch.
- dr_dshiv 5 months ago
  
  Beautiful