Tag: AI

  • I’m sure this is not my idea, so I’m not claiming it to be. I’ve been wanting to do a sort of continuous AI eval in production for a while, but the situation never presented a work. It was a mixture of having the data to do the eval off line, and wanting to avoid the risks of doing it in prod. But now I’m going to do it for a side project.

    I don’t want to reveal what my side project is yet, so I’ll keep it vague. I’m very excited about this part, so I wanted to share it early. And I’m hoping that the Internet will tell me if, as it usually does, if this is a bad idea.

    I have a task that will be done by an AI and I can measure how successful it was done but only 2 to 7 days after the task was completed and seeing it out there, in the world. I will gather some successful examples to use as part of the prompt, but I don’t have a good way to measure the AIs output other than my personal vibes which is not good enough.

    My plan is to use OpenRouter and use most models in parallel, each doing a portion of the tasks (there are a lot of instances of these tasks). So if I go with 10 models, each model would be doing 10% of the tasks.

    After a while I’m going to calculate the score of each model and then assign the proportion of tasks according to that score. So the better scoring models will take most of the tasks. I’m going to let the system operate like that for a period of time and recalculate scores.

    After I see it become stable, I’m going to make it continuous, so that day by day (hour by hour?), the models are selected according to their performance.

    Why not just select the winning model? This task I’m performing benefits from diversity, so if there are two or more models maxing it out, I want to distribute the tasks.

    But also, I want to add, maybe even automatically, new models as they are released. I don’t want to have to come back to re-do an eval. The continuous eval should keep me on top of new releases. This will mean a fixed percentage for models with no wins.

    What about prompts? I will also do the same with prompts. Having a diversity of prompts is also useful, but having high performing prompts is the priority. This will allow me to throw prompts on the arena and see them perform. My ideal would be all prompts in all models. I think here I will have to watch out for the amount of combinations making it take too long to get statistically significant data about each combination’s score.

    What about cost? Good question! I’m still not sure if cost affects the score, as a sort of multiplier, or whether there’s a cut-off cost and if a model exceeds it, it just gets disqualified. At the moment, since I’m in the can-AI-even-do-this phase, I’m going to ignore cost.

  • Bring your own keys into an affiliate relationship

    I’m starting to observe a problem where a lot of LLM-enhanced apps are starting to pop up. For coding you have Cursor, but now there’s also a terminal called Warp and it costs $15/month. For individuals, consultants, small and even medium sized companies, this isn’t a workable pricing model. All apps were already turning into subscriptions and the cost of LLMs is accelerating that.

    What compounds the problem is that, because everyone feels uncomfortable with the potential surprise high bill of pay-for-what-you-use, many of these apps are charging a single monthly fee. A simple single flat fee, except that the cost of the LLMs is not flat. It grows linear with usage. This means having to throttle the LLM usage to stay within the margin of the flat fee and have a chance at a profit. That means using cheaper models, which yield worse result on average.

    I think this is why when I compare Cursor to Claude Code I find Claude Code to be better: I’m giving Anthropic a lot more money than Cursor. But also, I’m happy with that, because I can use Claude in many other ways, where Cursor is a single-use application.

    I think from now on, for each LLM powered application, I either want to be able to put my own keys in, or have pay-as-you-go with a lot of transparency. When I need it to work, I want it to work well. When I’m not using it, I want it to cost nothing.

    There’s another solution though. LLM providers could have affiliate systems where other companies get a commission for the token usage they generate. Using Warp as an example, Warp wouldn’t ask me for $15/month. Warp would ask me for my Claude API key. Warp would identify all those requests as caused by Warp, and then Anthropic would pay Warp a proportional fee for the usage generated.

    This is a win-win-win: Anthropic gets more token usage (more customer expansion), Warp gets more customers (I’m not paying $15/month, but I would plug in my key), and the user gets to have another LLM tool that otherwise they would not.