As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 8 months ago

Huh, now that’s a classic I never thought would get a remaster/re-release! I played this a ton when I was a little kid in the 90s on my Sega Genesis.

Though I’ll probably stick to purchasing on Steam. I’m steering clear of Nintendo where possible.

th3raid0r@tucson.social · 1 year ago

It is definitely an under provisioning problem. But that under provisioning problem is caused by the customers usually being very very stingy about what they are willing to spend. Also, to be clear, it isn’t buckling. It is doing exactly The thing it was designed to do. Which is to stop writes to the DB since there is no disk space left. And before this time, it’s constantly throwing warnings to the end user. Usually these customers tend to ignore those errors until they reach this stop writes state.

In fact, we just had to give an RCA to the c-suite detailing why we had not scaled a customer when we should have, but we have a paper trail of them refusing the pricing and refusing to engage.

We get the same errors, and we usually reach out via email to each of these customers to help project where their data is going and scale appropriately. More frequently though, they are adding data at such a fast clip that them not responding for 2 hours would lead them directly into the stop writes status.

This has led us to guessing what our customers are going to end up at. Oftentimes being completely wrong and eating to scale multiple times.

Workload spikes are the entire reason why our database technology exists. That’s the main thing we market ourselves as being able to handle (provided you gave the DB enough disk and the workload isn’t sustained for a long enough to fill the discs.)

There is definitely an automation problem. Unfortunately, this particular line of our managed services will not be able to be automated. We work with special customers, with special requirements, usually fortune 100 companies that have extensive change control processes. Custom security implementations. And sometimes even no access to their environment unless they flip a switch.

To me it just seems to all go back to management/c-suite trying to sell a fantasy version of our product and setting us up for failure.

th3raid0r@tucson.social · 1 year ago

Probably not feasible in our case. We sell our DB tech based on the sheer IOPS it’s capable of. It already alerts the user if the write-cache is full or the replication cache is backing up too.

The problem is, at full tilt, a 9 node cluster can take on over 1GB/s in new data. This is fine if the customer is writing over old records and doesn’t require any new space. It’s just that it’s more common that Mr. customer added a new microservice and didn’t think through how much data it requires. Thus causing rapid increase in DB disk space or IOPs that the cluster wasn’t sized for.

We do have another product line in the works (we call it DBaaS) and that can autoscale because it’s based on clearly defined service levels and cluster specifications. I don’t think that product will have this problem.

It’s just these super mega special (read: big, important, fortune 100) companies have requirements that mean they need something more hand-crafted. Otherwise we’d have automated the toil by now.

th3raid0r@tucson.social · 1 year ago

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r@tucson.social · 2 years ago

I dunno Mr. Google, but I’m fairly sure Azure won’t decide to sell of their domain registrar out from underneath their customers.

I’m fairly certain that Azure won’t drastically update the “packages” to buy ever 6 months like GCP/Gcloud did.

I’m fairly certain, that given the track record of Google products and services, that this has nothing to do with Azure being “anti-competitive” and everything with Google being known for axing their own products. If I build something on GCP, I can’t trust that it will continue to run unattended. I know I’ll need to always keep my eyes on the news feed should Google axe another product I was using.

th3raid0r@tucson.social · edit-2 2 years ago

Agreed, and my one call to action post to get other Admins to give a crap fell on it’s face over on beehaw. It seems that many admins really think that every instance should use manual registration, or other tools. All in all, the message I got was “The devs don’t have to listen to anyone”.

I’m now of the opinion that most lemmy admins aren’t people I want to associate with, they seem to be all about “open source” until it collides with concepts like “collective responsibility” and you’ll get a response in the individualist line of reasoning of “Oh, just fix it yourself”.

Kbin is sure lookin’ pretty good these days now.

th3raid0r@tucson.social · 2 years ago

Admin of tucson.social here - I haven’t noticed an attack on my instance yet but I do have Captcha AND Email validation turned on.

Since my instance is for Arizonan’s only, I could do a geo-ip block if pressed, but obviously that won’t work for places like startrek.website.

If any admin needs assistance, I recommend enlisting some help over at programming.dev - likely the best instance for collaborating on our lemmy servers.

th3raid0r@tucson.social · edit-2 2 years ago

In lemmy’s case, my perusal of the DB didn’t really suggest that the queries would be that complex and I suspect that moving it to a higher performance NoSQL DB might be possible, but I’d have to take a look at a few more queries to be sure.

I wonder if this could be made to work with Aerospike Community Edition…

Obviously it could be more effort than it’s worth though.

th3raid0r

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?