Introducing DSPy Visualizer

Rogerio - CTO

Jun 3, 2024

DSPy automatically improves your prompts, allowing you to focus more on the structure and the expectations for your app, and switching models with ease. We found it to be way more effective than experienced human prompt engineers, getting a higher accuracy in a much shorter timeframe.

If you are building with LLMs and have done your fair share of prompt engineering, LLM calls after LLM calls, trying to improve your product performance, DSPy is the logical next step for you.

What is DSPy?

DSPy is the hot new framework on LLM community, which takes a different approach than what so far has been done for the past year: instead of having engineers manually trying to adjust prompt to improve results and do a *vibe-checking* on the outputs, DSPy gets you to focus on the dataset examples you bring, and the objective metrics that come out, not unlike traditional Machine Learning.

DSPy takes a big inspiration from PyTorch, which brings a much more rational and structured approach to optimize LLM pipelines, in a very coherent framework where all the pieces fit very well together.

In short, by giving your dataset to your DSPy program, and defining which metrics mean a good result by evaluating the outputs, DSPy Optimizers can automatically try to find the best combinations to improve your LLM program, be it by changing the prompts, the demonstrations for Few-Shot Prompting, or slight changes in the structure (signatures).

We found this approach to be very effective. In one case where we couldn’t get accuracy higher than 60% after many hours of manual prompt engineering, DSPy took it to 80% for us in a matter of minutes.

Prompt engineering at last automated, and the results are by definition measurable!

DSPy Visualizer

At LangWatch we have been very excited on the results we are getting from DSPy, so today we are making public a tool we already have been using internally which help us a lot on our DSPy training sessions, the DSPy Visualizer:

Screenshot of LangWatch DSPy Visualizer

DSPy Visualizer, allows you to log your DSPy training sessions, track the performance, costs, compare runs and debug them in detail.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student, or it gave us new ideas on what to try. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

Since DSPy runs a lot of trials in a short amount of time, token rate limits and LLM calls cost also play a big factor, LangWatch’s DSPy Visualizer tracks all LLM calls for you, making it easy to compare for example the cost of using GPT-3.5 Turbo vs Claude 3 Haiku, or noticing that you are reaching 400K tokens per minute and know what rate limit to request further, which was our case.

If you are curious to try it out too, check out our quickstart docs, or book a demo with us.

DSPy automatically improves your prompts, allowing you to focus more on the structure and the expectations for your app, and switching models with ease. We found it to be way more effective than experienced human prompt engineers, getting a higher accuracy in a much shorter timeframe.

If you are building with LLMs and have done your fair share of prompt engineering, LLM calls after LLM calls, trying to improve your product performance, DSPy is the logical next step for you.

What is DSPy?

DSPy is the hot new framework on LLM community, which takes a different approach than what so far has been done for the past year: instead of having engineers manually trying to adjust prompt to improve results and do a *vibe-checking* on the outputs, DSPy gets you to focus on the dataset examples you bring, and the objective metrics that come out, not unlike traditional Machine Learning.

DSPy takes a big inspiration from PyTorch, which brings a much more rational and structured approach to optimize LLM pipelines, in a very coherent framework where all the pieces fit very well together.

In short, by giving your dataset to your DSPy program, and defining which metrics mean a good result by evaluating the outputs, DSPy Optimizers can automatically try to find the best combinations to improve your LLM program, be it by changing the prompts, the demonstrations for Few-Shot Prompting, or slight changes in the structure (signatures).

We found this approach to be very effective. In one case where we couldn’t get accuracy higher than 60% after many hours of manual prompt engineering, DSPy took it to 80% for us in a matter of minutes.

Prompt engineering at last automated, and the results are by definition measurable!

DSPy Visualizer

At LangWatch we have been very excited on the results we are getting from DSPy, so today we are making public a tool we already have been using internally which help us a lot on our DSPy training sessions, the DSPy Visualizer:

Screenshot of LangWatch DSPy Visualizer

DSPy Visualizer, allows you to log your DSPy training sessions, track the performance, costs, compare runs and debug them in detail.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student, or it gave us new ideas on what to try. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

Since DSPy runs a lot of trials in a short amount of time, token rate limits and LLM calls cost also play a big factor, LangWatch’s DSPy Visualizer tracks all LLM calls for you, making it easy to compare for example the cost of using GPT-3.5 Turbo vs Claude 3 Haiku, or noticing that you are reaching 400K tokens per minute and know what rate limit to request further, which was our case.

If you are curious to try it out too, check out our quickstart docs, or book a demo with us.

DSPy automatically improves your prompts, allowing you to focus more on the structure and the expectations for your app, and switching models with ease. We found it to be way more effective than experienced human prompt engineers, getting a higher accuracy in a much shorter timeframe.

If you are building with LLMs and have done your fair share of prompt engineering, LLM calls after LLM calls, trying to improve your product performance, DSPy is the logical next step for you.

What is DSPy?

DSPy is the hot new framework on LLM community, which takes a different approach than what so far has been done for the past year: instead of having engineers manually trying to adjust prompt to improve results and do a *vibe-checking* on the outputs, DSPy gets you to focus on the dataset examples you bring, and the objective metrics that come out, not unlike traditional Machine Learning.

DSPy takes a big inspiration from PyTorch, which brings a much more rational and structured approach to optimize LLM pipelines, in a very coherent framework where all the pieces fit very well together.

In short, by giving your dataset to your DSPy program, and defining which metrics mean a good result by evaluating the outputs, DSPy Optimizers can automatically try to find the best combinations to improve your LLM program, be it by changing the prompts, the demonstrations for Few-Shot Prompting, or slight changes in the structure (signatures).

We found this approach to be very effective. In one case where we couldn’t get accuracy higher than 60% after many hours of manual prompt engineering, DSPy took it to 80% for us in a matter of minutes.

Prompt engineering at last automated, and the results are by definition measurable!

DSPy Visualizer

At LangWatch we have been very excited on the results we are getting from DSPy, so today we are making public a tool we already have been using internally which help us a lot on our DSPy training sessions, the DSPy Visualizer:

Screenshot of LangWatch DSPy Visualizer

DSPy Visualizer, allows you to log your DSPy training sessions, track the performance, costs, compare runs and debug them in detail.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

This helps make it very easy to understand the step the optimizer is going through, when using it ourselves, multiple times we noticed some issues on the pipeline, like the signatures not being used correctly, or the wrong LLM being called when for example we wanted to use GPT-4 as a teacher for a GPT-3.5 student, or it gave us new ideas on what to try. This allowed us to stop the experiment earlier and iterate on our setup more quickly.

Since DSPy runs a lot of trials in a short amount of time, token rate limits and LLM calls cost also play a big factor, LangWatch’s DSPy Visualizer tracks all LLM calls for you, making it easy to compare for example the cost of using GPT-3.5 Turbo vs Claude 3 Haiku, or noticing that you are reaching 400K tokens per minute and know what rate limit to request further, which was our case.

If you are curious to try it out too, check out our quickstart docs, or book a demo with us.