Tag: idea

Constant Eval in Production, the idea

December 13, 2025

Technical

AI, artificial intelligence, idea, llm

I’m sure this is not my idea, so I’m not claiming it to be. I’ve been wanting to do a sort of continuous AI eval in production for a while, but the situation never presented a work. It was a mixture of having the data to do the eval off line, and wanting to avoid the risks of doing it in prod. But now I’m going to do it for a side project.

I don’t want to reveal what my side project is yet, so I’ll keep it vague. I’m very excited about this part, so I wanted to share it early. And I’m hoping that the Internet will tell me if, as it usually does, if this is a bad idea.

I have a task that will be done by an AI and I can measure how successful it was done but only 2 to 7 days after the task was completed and seeing it out there, in the world. I will gather some successful examples to use as part of the prompt, but I don’t have a good way to measure the AIs output other than my personal vibes which is not good enough.

My plan is to use OpenRouter and use most models in parallel, each doing a portion of the tasks (there are a lot of instances of these tasks). So if I go with 10 models, each model would be doing 10% of the tasks.

After a while I’m going to calculate the score of each model and then assign the proportion of tasks according to that score. So the better scoring models will take most of the tasks. I’m going to let the system operate like that for a period of time and recalculate scores.

After I see it become stable, I’m going to make it continuous, so that day by day (hour by hour?), the models are selected according to their performance.

Why not just select the winning model? This task I’m performing benefits from diversity, so if there are two or more models maxing it out, I want to distribute the tasks.

But also, I want to add, maybe even automatically, new models as they are released. I don’t want to have to come back to re-do an eval. The continuous eval should keep me on top of new releases. This will mean a fixed percentage for models with no wins.

What about prompts? I will also do the same with prompts. Having a diversity of prompts is also useful, but having high performing prompts is the priority. This will allow me to throw prompts on the arena and see them perform. My ideal would be all prompts in all models. I think here I will have to watch out for the amount of combinations making it take too long to get statistically significant data about each combination’s score.

What about cost? Good question! I’m still not sure if cost affects the score, as a sort of multiplier, or whether there’s a cut-off cost and if a model exceeds it, it just gets disqualified. At the moment, since I’m in the can-AI-even-do-this phase, I’m going to ignore cost.
An idea for a cinema company

June 22, 2012

Personal

app, application, Business, cinema, doodle, idea, theater
I doubt any cinemas are going to implement this, because like airlines and banks, they seem to be very bad at making software. Nothing surprising there.

A few months ago I was searching for a room in London. There are about 4 big sites to do that, so I posted ads on all of them, and searched on all of them. Only one provided a web application that allowed me to see whether I contacted someone already or whether I marked a flat as not-suitable. It made searching so much easier that soon I was using that and only that site. The ads weren’t better per se, but the software was.

I like going to the movies with friends but I dread having to organize it. It’s such a pain because you have to balance the available time of each people, the timetable of the cinema, the shows in which there are still good seats, the fact that the seats might be going unavailable, and handling the money (I tend to pick the cool but expensive theaters).

If I was in charge of a cinema, I would make a built-in Doodle. Doodle is an awesome application that helps you organize an event. You select all the desirable dates and times, invite the people, and they respond yes or no to each slot. At the end you pick one and go for it. I thought of setting up a Doodle to organize going to see The Dark Knight, but I ended just picking a date and time that was convenient for me and inviting people. It didn’t work.

The built-in Doodle could work like this:
1. I go to the cinemas website.
2. I buy my ticket.
3. I pick all the shows I can go to.
4. Set a deadline (maybe, optional).
5. I send the invite to all the people that might want to join me.
Notice that I paid for my ticket before picking the date and time. I’m not sure whether that’s a good idea, I would be okay with that but maybe not everybody. What do you think?

Then each person that I invited goes to the web site and:
1. Look at all the dates and times I and others picked.
2. Buy their ticket or tickets.
3. Pick the dates and times they can.
Once everybody is in or I’m done waiting, I pick a day and time and I get all the seats assigned together in one action (even though the action of committing to the movie was individual and asynchronous). For those that didn’t get a ticket or those that changed their mind, they get their money back and/or the option to arrange the same movie, another day, with some of the same group and/or adding other people to it.

For the cinema it’s a revenue booster. It makes it easier for people to commit to going to the cinema. And even people than don’t manage to go one day are compeled to go another because they already paid.

Build it with nice Facebook and Twitter integration and that’s it, you’ll be the most popular cinema in town.
My idea got validated

February 14, 2011

Personal

Business, idea

Some time ago I had an idea for a web application. That idea was essentially Gist. I couldn’t convince people to go after it, thankfully because I wouldn’t like to compete with a Brad Feld backed company. It’s nice to see that the idea was good, as Gist got bought by RIM. Congratulations Gist!
Obsolete email addresses (a feature request)

November 13, 2010

Personal

address, client, email, feature, historic, idea, mail, obsolete, request

This is a feature I wish my programs I’m using to read email had. Sometimes, some people change email address. It happens, to some more than to others. When that happens I don’t change the email address for that person in my contact list. I add the new address.

The reasons is that I still want to maintain an association between all those emails I’ve sent and I received from that person and the contact details for that person. The idea is that when I ask my software for all emails from “John Smith”, even if John Smith changed addresses 15 times, it should still be able to find the old ones.

The problem is that sooner or later I send an email to that person using the obsolete email address. I really wish the software would allow me to mark addresses as obsolete or historic so that the information is not lost but I never use them again.

Rails come with some awesome assertion methods for writing tests:

assert_difference("User.count", +1) do
  create_a_user
end

That asserts that the count of user was incremented by one. The plus sign is not needed, that’s just an integer, I add it to make things clear. You can mix several of this expressions into one assert_difference:

assert_difference(["User.count", "Profile.count"], +1) do
  create_a_user
end

That works as expected, it asserts that both users and profiles were incremented by one. The problem I have is that I often found myself doing this:

assert_difference "User.count", +1 do
  assert_difference "Admin.count", 0 do
    assert_difference "Message.count", +3 do  # We send three welcome messages to each user, like Gmail.
      create_a_user
    end
  end
end

That looks ugly. Let’s try something different:

assert_difference("User.count" => +1, "Admin.count" => 0, "Message.count" => +3) do
  create_a_user
end

Well, that looks nicer, and straightforward, so I implemented it (starting from Rails 3 assert_difference):

def assert_difference(expressions, difference = 1, message = nil, &block)
  b = block.send(:binding)
  if !expressions.is_a? Hash
    exps = Array.wrap(expressions)
    expressions = {}
    exps.each { |e| expressions[e] = difference }
  end

  before = {}
  expressions.each {|exp, _| before[exp] = eval(exp, b)}

  yield

  expressions.each do |exp, diff|
    error = "#{exp.inspect} didn't change by #{diff}"
    error = "#{message}.\n#{error}" if message
    assert_equal(before[exp] + diff, eval(exp, b), error)
  end
end

Do you like it? If you do, let me know and I might turn this into a patch for Rails 3 (and then let them now, otherwise they’ll ignore it).

Update: this is now a gem.

An intelligent music player

December 9, 2009

Personal

Amarok, idea, intelligent, iTunes, machine learning, music, Ogg, player
I still haven’t found a good music player, for my computer that is. The one that got the closest to it was Amarok, but still it was very far away. My problem is that I don’t know what to listen to, really! I’m only just finding out what music to use for coding. There’s one thing I really want from a music player: for it learn what to play for me. It’s not the same as learning what I like. It’s much more complex. Amarok learns what I like, but not really what to play for me.

In Amarok, when you jump to the next song it checks how much of the song you listened and assigns a score based on that. For songs that you listen completely you get a high score and for songs you listen only for a couple of seconds a low score. Over time, as you listen, those you like most and listen most will get high scores while those you despise and jump immediately will get a lower score.

Amarok has a special playing list, or used to have in the 1.4 version, which is called “dynamic” and plays those songs with the highest score. That sounds excellent, but it’s not enough. This music player I’d like to have would not compute how much I like a song, like Amarok, but how probable it is that I’ll like it when it plays that song.

Let’s call this player Pamup, Pablo’s Music Player, and let’s see how it could provide such a magic feat as playing songs that you want to listen (even if you don’t know you want to listen to them).

Pamup would have a scoring for the songs but instead of being a linear score it’ll be multidimensional. Let’s start with two simple dimensions and the rest will be clear: percent of playing time and time of the day. Song A you play 100% and song B 50%. That means that you like song A better than B. That is what Amarok does. Pamup would instead record:
- Song A in the morning: 100%
- Song B in the morning: 50%
- Song A in the evening: 50%
- Song B in the evening: 100%
You like A as much as B, but you are more likely to want to listen to A in the morning, and B in the evening. Of course adding the time of the day will probably not improve the equation by much. The idea would be to add as many dimensions as possible. Some dimensions may be irrelevant and they should cancel themselves out, like in this case:
- Song A in the morning: 100%
- Song B in the morning: 50%
- Song A in the evening: 100%
- Song B in the evening: 50%
In that case, you like A better than B, in the evening and in the morning. The time of the day is irrelevant. Maybe it’s only irrelevant for some songs but not for other:
- Let it be, I like it at all times.
- O Fortuna of Carmina Burana, please, don’t wake me up with that (or maybe yes, please do, not sure).
Maybe it’s irrelevant for some people, but not for others. I don’t know and we don’t need to know.

I can think of many other dimensions to add to the system and I’m sure many other people will think of more and as technology improves we’ll be able to have even more:
- What program are you using? I want music that helps me concentrate when I’m using my text editor to write code while I don’t care much about what I’m listen to while web browsing.
- What are you browsing? Maybe I do care about the music while I’m web browsing. Redditing and Facebooking can be done pretty much with any music, but if I’m at Lambda the Ultimate, I need something to concentrate. Even some analysis of the web site could give some important hints: lot’s of dense text, no pictures, play Mozzart; a photo blog, play whatever.
- How are you controlling the player? Are you using the keyboard with global shortcuts? you are probably doing something else. Are you using the remote control? you are probably away from the computer. Are you using the mouse directly into the players window with the lyrics window open? Ok, let’s play something with lyrics because you probably feel like reading, maybe even signing.
- Are you singing? When can find that out using the computer’s microphone. Let’s play things that are in your vocal range, and mostly by the same gender as you are. Let’s also play things you liked singing before.
- Are you using only one app or switching between various apps?
- Which apps are you switching with?
- Is there any other sound coming out from the computer? If so, maybe soothing background music with not much volume is what the player should play.
- Are you dancing? Let’s disco! You think that’s a tough one? Most smart phones have accelerometers in them, if you have the smart phone on your pocket I’m sure I can find out if you are in the couch or dancing, or maybe moving but not dancing. Even the raw input of the accelerometer could be used as a signal, because it’ll be different depending how tired you are and how you are dancing.
- Are you alone? You think that’s a hard one as well? Many people are using wifi, so, what’s the strength signal received on other devices on the same network?. If another computer has a similar signal level as yours and it is being used, you probably are not alone. It could also be done using smart phones, although with a smart phone you don’t require to be in use, you require it not to be on the table. If it’s plugged into the computer, you can ignore it, if it’s flat and not moving (accelerometer again), you can ignore it.
- Who are you with? I hope by now you realize how much we can find out. Let’s make it social, let’s have the app in every device. Why would people install it? Well, when you visit me, if you have it on your device, you’ll device will tell my computer what you like, and my computer knows what I like, so it’ll try to find a common ground for us (and it won’t trust me that much when I skip a song, because maybe it’s you skipping it). We could make you use your own smart phone to skip it, and then Pumap knows who is skipping it.
- Who are you talking with? If you are talking with other people, using voice recognition you may identify that people, or at least how many there are. If there’s cutlery clater in the background, people are eating, let’s just play background music for a nice evening. If it’s only you speaking, maybe you are in an old land-line phone (if you were using your smart phone, Pumap would know), let’s cut the music altogether, probably it’s distracting.
I believe this program should not work with special cases but have some very sofisticated machine learning system where we input all these signals and does the right thing. And as more signals become available, they are added and analyzed as well. I would like to have that music program! Because honestly, really, I’m not sure what music I want to listen to. I want my computer to figure it out for me.
Fiction blogging

November 25, 2009

Personal

blog, blogging, entretainment, fun, idea

As stories can be told in first person, or third person, in the form of a diary or a tale, as book or comic or movie; I was thinking that blogging could be a literarly style as well.

I can think of two sub-genres. Historic and fantastic blogging.

For historic imagine a blog written in the context of -70 (minus 70) years. So that on October 19th 2009 you’d get a post for October 19th 1939. Who would be the blogger? It could be an important person, what would Churchil blog? Or it could be an unnamed person, an anti-nazi frenchman for example. They could also have a Twitter account! It would be an interesting way to learn history.

The other genre would be total fantasy. A blogger in the future, imagine if for some strange reason, blog posts of a guy surviving the singularity travel back in time? What if blog posts from a galaxy far away? I would certainly follow those blogs! But of course, it’s hard work that requires a very good writer.

Another thing that could be applied to any fictional blogging is having a network of blogs. Imagine reading the blogs of a frenchman in the resistance, a nazi soldier, a Russian red-army member. All blogging about the same, from different perspective! What about reading the Twitter feeds of Frodo, Sam, Gandalf, Aragorn, Saruman?

I think it would be very entertaining.
The tracker for movies

November 18, 2009

Personal

anobii, books, Business, film, idea, last.fm, movie, movies, music

For showing what music I like, keeping track of what music I listen to, discovering new music and finding people with the same tastes I use last.fm. For doing that but with books I use aNobii. Is there anything like that for movies? If not, there’s a market.
A feature we need for the post-PC era

November 11, 2009

Personal

idea, mobile, post-PC, processing, sharing

The post-PC era is when we stop having PCs because we move to something else. You may think that’s unlikely and unrealistic but look at the evidence. At one time we had desktop computers and laptops started to appear. They were just toys for people with lots of money, then they became the second computer of people that spent a lot of time on the go, today most people own a laptop instead of a desktop computer.

The exodus from the PC is not going to be that easy, because the mobile devices are more different to a laptop than laptops were to desktop computers. But it’s not only leaving PCs for smart phones, also for netbooks. I believe it’s going to happen. Probably not as extreme as the PC to laptop but it’s going to happen. We’ll be using our phones as our primarly way to access data and communicate. And when we come home we’ll plug it in a dock station -that already exists-, so that it can use our nice big speakers -that also exists- and so that we use an external keyboard -that exists in many cases- and a big external screen -that exists, at least in the netbook market-.

What doesn’t exist yet, I believe, is external processing. When I’m at a bar I won’t play Halo and I’m OK if switching between applications is slow, but when I come home I want my device to become faster. I have seen absolutely no progress at all in being able to add processing power to a machine, to a portable machine. The closer I’ve seen were docking stations, probably IBM, which added better sound and video cards. Sharing processing power is hard, but I think we need it to go mobile.

Reviewed by Daniel Magliola. Thank you!
Deferred Twitter posting

September 22, 2009

Personal

Business, idea, twitter

Here’s an idea for those Twitter clients, web and desktops out there: deferred posting.

One tweet per hour during eight hours is much more effective than 8 tweets in a row. But sometimes you want to write eight tweets in a row and I find two reasons to do that.

You are using Twitter professionally, for your work, as a marketing and social tool. You want to minimize the hit it takes on your productivity so you limit yourself to 15 minutes of tweeting per day. In those 15 minutes you generate tweets for the whole day, you want them to be automatically distributed through the day.

When you open twitter after some hours of not using it, like after sleeping, you’ll find yourself replying to lot’s of stuff as you go through it. That’s specially true if you are 8 timezones away or so from most people you follow.

I think a Twitter client should do the distribution automatically. It could distribute them evenly through the day, depending on how many you have on your queue. Whenever you want to tweet you just add it to the queue.

Why limit itself to one day? why not leave tweets for tomorrow? And if not one day, how long? A way to solve the problem is to try to maintain your speed constant, minimize acceleration and deceleration.

For example. If you normally tweet 5 times a day, and you have 10 tweets in your queue, do 7 today and leave 3 to tomorrow so that you don’t double the speed, you just increase it a little bit. If tomorrow you add another 10, you’ll have 13 and you are at a speed of 5.2 (previously you were at 5, but yesterday with 7 you sped up a little). So today you get 9 published and 4 left for tomorrow and so on.

You’ll have different speeds on weekends and business hours. There’s a curve of speed and the Twitter client should try to match it with what you have on the queue.

If you want to direct tweet, you can do that, just fine.

Another interesting way is to match the curves of you readers instead of your own. The tweeter client would measure when your readers are posting more, and presumably, also reading more. It’ll make an average and it’ll have the curve of speed of your network. Instead of posting following your previous curve, it’ll post following your network’s curve maximizing the amount of people that is likely to read your Tweet.

I would call that, Professional Tweeting.

Another interesting feature would be to set importance to your tweets. More important tweets are sent when the chances of getting it read are highest, when the curve reaches its peak.

Reviewed by Daniel Magliola. Thank you! Twitter carved-wood icon by gesamtbild.