Who's talking to my cloud?

It's the second time we get into this situation. It always begins the same. Someone says : "We'll do authentication after proving the design".

Then a few months into the project, it's time to implement authentication. It takes a few weeks to sync all services and clients.

Are we all good then?

Many months later, the monitoring still looks like this...

You'd think 400 class status codes are client side request problems so why are we monitoring them? The answer is that this is still pre-release. Those would be pretty valuable to find client side errors and report them to the correct team if the monitoring wasn't completely flooded...

Those failed requests are coming from 5-6 clients running on tester VMs or IoT devices and no one on the cloud services team can locate them... An intern that left an application running on some shared infrastructure and went back to school? A tester who forgot about that one off test he did on one of his physical machines?

The request IP is the corporate public address or that of a data center and there is no identifiable information in the payload because privacy protection good practices justly mandate it. And without authentication, there is no way to trace it back to anyone...

Every now and then, one of those is found and shutdown so the number of failed requests dips a little as we see in this graph. It's the little victories...

In this case, it's much worse since the old Proof of Concept client side application had a bug where the backoff mechanism didn't work and it just retried those errors in a tight loop...

Lessons learned

When building an API, before even the first client connects...

Make sure you can identify your clients even if authentication isn't a requirement.
Add the client application version in the request header. It will make it so much easier to retire deprecated functionalities.

In any case, next time someone says "We'll do authentication after proving the design", I'll have a war story to link them!