I'll provide a scenario for how the above potential feature is envisioned.
Let's say a node goes down. A pop-up appears saying "Hey, this might be facing issues. Want to deploy a backup service?"
You click yes, it generates a new service with the same specifications in a nearby location or same location if still available, if it succeeds, it loads in and gets marked as your "secondary" service. To a "secondary" service. Your service details page marks the "secondary" service as the "active" one, and shows you a notice that your other service has been moved to being in a passive state and will be terminated automatically in 7 days. You can visit it in the second tab.
The "secondary" service, when marked active, will also have a "load in backups" option. Checks for one, loads it in. If it fails, it tells you to re-install an operating system instead. The "active" service being the secondary means you get full controls on it.
The decision is now yours on how you want to proceed. By default, the "secondary" service moves to being the permanent new service. After 7 days, we assume we're doing a terrible job at restoring the original and it goes away from your view. Otherwise if it comes back, within these 7 days, you can instead choose to go back to the original service, visit the tab, and switch it to the active and confirm the "secondary" will go in a passive state and be deleted when timer runs out instead. You can technically switch back between these as much as you want if both are available during the 7 days, but when it's switched one powers down when going to inactive state. We may add a button for temporary power on of inactive service when available, one-time, for like 8 hours to try to transfer files between them.
This process also helps us set a deadline for fixing nodes, if it's not fixed after 7 days it'll still appear in your "history" later on as a feature on the second tab, where you can contact support about it (we let you know beforehand in a pop-up if nothing can/will be done about it, it'll just get marked as rejected, no human response unless we can do something.)
@VirMach said: I'll provide a scenario for how the above potential feature is envisioned.
My personal opinion is that you are adding layers of complexity on to a system, SolusVM/WHMCS, that I expect will only cause additional issues, and even more support tickets. If you are trying to eliminate extended down times for customers it might be better to be proactive as opposed to reactive. Working on the root causes of these issues may be a better way to go about it. The frequency of reoccurring extended down time issues at VirMach is not a normal thing in the industry.
When a node has an unsolvable hardware issue like LAX2Z019 or DFWZ007 just telling people that their VM is going to be recreated somewhere else is the right answer IMO. I don't like being told that I am going to have to spend a few hours rebuilding something, but it beats waiting 1-6 weeks and then being told I need to rebuild it anyway.
IMO simplifying your operating methods, and working with reliable, responsive partners is the way to solve these issues.
EDIT: Of course since I am looking at things from just a customer's perspective I may not understand all the issues involved.
@VirMach said: I'll provide a scenario for how the above potential feature is envisioned.
My personal opinion is that you are adding layers of complexity on to a system, SolusVM/WHMCS, that I expect will only cause additional issues, and even more support tickets. If you are trying to eliminate extended down times for customers it might be better to be proactive as opposed to reactive. Working on the root causes of these issues may be a better way to go about it. The frequency of reoccurring extended down time issues at VirMach is not a normal thing in the industry.
When a node has an unsolvable hardware issue like LAX2Z019 or DFWZ007 just telling people that their VM is going to be recreated somewhere else is the right answer IMO. I don't like being told that I am going to have to spend a few hours rebuilding something, but it beats waiting 1-6 weeks and then being told I need to rebuild it anyway.
IMO simplifying your operating methods, and working with reliable, responsive partners is the way to solve these issues.
EDIT: Of course since I am looking at things from just a customer's perspective I may not understand all the issues involved.
Shorter version of my reply, if the issue is we have a bad time during emergencies trying to load in new services, do you think it could be worse with the self-serve optional system? Or I guess how specifically, what do you envision would go wrong where it's worse than the current alternative of just being down? I have some answers myself as I've thought it out, wondering what yours are for that.
As for unsolvable issues, right answer being recreate somewhere else I'll present you the issue that comes with this and that's customers who value their data more than a service being up, and just get angry if a new service goes up with no data. There's a good number of these, where they misunderstand the situation, and then on top of that moving more people is more work, especially while also dealing with the other work involved.
@FrankZ said: you are trying to eliminate extended down times for customers it might be better to be proactive as opposed to reactive. Working on the root causes of these issues may be a better way to go about it.
I feel like we've nailed this part, even if it doesn't seem like it, due to everything else going on. The latest fix looks great on every node except the ones with a specific board we have like maybe 3 or 4 of right now, most of those already got retired. So we've already handled most if not all of the "proactive" and this would be the "reactive" portion. The balance isn't like 100% to 0% proactive to reactive, there still needs to be some changes made to the reactive side.
But yes generally agreed, I have the same philosophy.
We definitely wouldn't do this first, and then try to fix the problem. We only do that in emergency scenarios so like for example, when we did the VLAN splitting that wasn't fixing the actual issue, it was just a Band-Aid, we do those too when necessary, this isn't one of them.
Sabey/Seattle update I just got, as usual their guy is really nice and just trying to help out and the delays were on Unitas side basically plus him and myself being busy and just trying to coordinate everything on each end. Anyway, it looks like they can just ship out our stuff, it's not 100% yet but here's the game plan since we're adding that into the mix:
We still try to load in everyone to LAX
Simultaneously we coordinate sending equipment to San Jose
Whichever gets people up faster, we go with that. We send people an email in any case just to confirm what happened and what else they can do (if they end up going up in San Jose, good, they stay there for now and we'd know which nodes to offer migration back to Seattle later optionally if we keep Seattle, if they end up going to LAX from backups, we make a list of anyone that wants to be loaded in from San Jose to LAX or LAX to San Jose or whatever in a neat way.)
As for Seattle going back up in Seattle, basically no-go on that, it's going to probably take 2 weeks after all that to get transit. Could be expedited for a high fee but not guaranteed, and it'd be difficult to do the right blend/IX right now in any case. It seems like we need to just do a relaunch for that location. Even expedited would still likely run into next week mid-week and that's if everything goes perfectly.
I think going back to trying to make improvements/adding more clunky features, we usually focus on how it goes wrong and then take what went right as the "default."
If we go back to stock WHMCS/SolusVM, I'm just trying to envision it right now, even with all the extra issues it may have caused at one time, I think the general vibe is that it was overall positive, saved us time. Like I actually cannot imagine being able to run things as efficiently as we do (whether that's 5% efficient, 50% or whatever) without all our weird scripts.
@VirMach said: Shorter version of my reply, if the issue is we have a bad time during emergencies trying to load in new services, do you think it could be worse with the self-serve optional system? Or I guess how specifically, what do you envision would go wrong where it's worse than the current alternative of just being down? I have some answers myself as I've thought it out, wondering what yours are for that.
My concerns here would be:
1. SolusVM or WHMCS makes a change that breaks this option after you implement it.
2. That customers will misinterpret what is going to happen when they activate this option, because many people seem to have a problem with reading comprehension and it is not something they are familiar with from other providers.
3. That both 1 & 2 will create tickets that require longer back and forth responses than when it is just down.
As for unsolvable issues, right answer being recreate somewhere else I'll present you the issue that comes with this and that's customers who value their data more than a service being up, and just get angry if a new service goes up with no data. There's a good number of these, where they misunderstand the situation, and then on top of that moving more people is more work, especially while also dealing with the other work involved.
I can understand clients who are paying you big bucks having these expectations, but not the lower end market. My understanding is that you are using RAID 0 because of various reasons relative to the drives and the Ryzen boards you had available at the time of build. In this system, like most hardware systems, it is not a question of if it will fail, but when. If you want to be able to restore customer data from backups make a good backup system and when a node has been determined as having a catastrophic failure, restore backups to a new node. I don't understand why the customer needs to be involved in this choice. I understand that this can't be done in every location because it would be difficult to justify a non utilized spare node in places like Denver, but in NYC, Tokyo, or Los Angeles this might be a realistic option. It also helps that you already have storage servers in these locations so spares could be used for both your and clients backup servers.
I also understand that places like Tokyo or Frankfurt where we used to hear about issues all the time have been silent for a quite a while now. I expect that is because you have resolved the hardware/firmware/kernel issues that were the cause of these issues and xTom is a good partner. I liked your OKC idea because I thought it might eliminated what I think has been your biggest issue, reliable partner DCs. There is not much you can do about things on the other side of the country if your DC partner will not pick up the phone or answer your ticket. So IMO this is a big part of of proactively solving the longer term issues.
I saw as a positive that you are moving Seattle to Los Angeles, I have a couple of VMs there and can understand why others may not feel so positive about the move, but if you see that Sabey can't work in a reasonable amount of time it is better to get out than continue adding complications.
I see that you have already answered some of the issues I spoke of above while I have been writing this. Please understand that I really do not have any concept of what you deal with on a day to day basis and these comments are not meant as a criticism. You should do what you think is best for your company, because you are the one who will have to manage the outcome of those decisions. It is easy for me to play arm chair quarterback while not having to deal with all the real day to day issues you face.
@VirMach said: If we go back to stock WHMCS/SolusVM, I'm just trying to envision it right now, even with all the extra issues it may have caused at one time, I think the general vibe is that it was overall positive, saved us time. Like I actually cannot imagine being able to run things as efficiently as we do (whether that's 5% efficient, 50% or whatever) without all our weird scripts.
I like what you have done with the WHMCS changes so far, and if it makes you more efficient all the better.
@FrankZ said: My understanding is that you are using RAID 0 because of various reasons relative to the drives and the Ryzen boards you had available at the time of build.
We're not that crazy, it's not RAID 0, just no RAID for most of them. The ones that we did end up doing "hardware" RAID10 are the ones that ended up having the biggest problems as I suspected could happen, because now it's another set of drivers that work less well especially when it comes to any updates. It also doesn't seem to work properly where if one drops off it does fine, same issue if one drops off in RAID10 so it's like basically no point in having it be RAID10 other than maybe allowing for recovery, maybe, but that's also a finnicky process. If an NVMe drops off, whether in "hardware" RAID10 or no RAID, same thing basically but with even more headache in RAID.
SolusVM released a feature that we could use to handle all this a little bit better, but that'd require more customization using the new API which I decided against, ironically for the reasons you described. I'll think about that further for this other change but I guess when it comes to something like disk I automatically agree with you, for this other change it's hard to gauge whether it could end up being worse.
I guess one way we could do it is roll it out for a limited number of servers in beta for a few months, or rather just try it out as one downtime occurs in the future and compare.
@FrankZ said: I don't understand why the customer needs to be involved in this choice.
I answered this in my long version which I didn't post, so I'll copy over the relevant portion.
Actually nevermind, it's so long and it's faster to just write it out again versus reading through to find it.
The reason is the same as one of the things you listed above which is customer confusion. So what we noticed when we did the recreations on a new node, was that people would get confused, over various periods of times in different ways. Like they'd not read anything, misunderstand, then end up getting angrier than if we had just left it offline. If it's offline, or request something themselves, they're aware of it. Otherwise they assume offline, something's wrong. If it's online, but now it doesn't have an operating system, it's in a different location, different server, different IPv4 address, those can all be seen as individual issues. Others it might get worse and they think now that their data is 100% lost, so either they don't contact us to proceed with any backups being loaded in, or just go off on us as if we had already definitely lost data and it goes into those types of tickets that take a very long time to defuse. Once they're already mad, it doesn't matter if we let them know what's actually happening, they're mad and get mad at every little part of the process and how it's done. And if we have to do all that work for it manually anyway and for everyone even if they do not want it, then by default it's more work, plus possible human error. If we automate it, then it's going to be doing the same thing and run into the same problems just in bulk and without any good refinements over time. One good thing about self-serve is it basically addresses all this and if it breaks the customer will also see it broke, and then it will definitely (okay probably not definitely, but as close as it gets) correctly be attributed by the customer as "I clicked this feature, and it broke." Versus "why did you lose my data / why is my IP different / why does it say no disk / where did my service go" and then also getting all the tickets about whatever else related to us just automatically redeploying everyone, even with communication.
@FrankZ said: If you want to be able to restore customer data from backups make a good backup system and when a node has been determined as having a catastrophic failure, restore backups to a new node.
Well this would be that. If your issue with it is it being presented to the customer as anything, and think we should just keep it on the backend then that could probably be done. I guess in a broad sense I see it this way, if we're trying to have a self-driving system for VirMach private roads, we might as well just apply that same idea for customer usage and then that helps refine it more quickly.
I guess I didn't really dive into a lot of the backend stuff, but it incorporates the actual backing up part, organizing it to where it can be done securely and neatly by the system. We already basically have our own scripts that are better than SolusVM's way of doing everything (they did finally "fix" some major issues with backups in latest release, but who knows if that actually means they didn't create a second problem.)
So let me do a quick comparison for you where we're at, this is compared to how SolusVM did it when we last checked, not most recent version. I didn't want to go into this part because it would end up being long, but here we go, now I have a good overview of this system as well I can save for documentation.
SolusVM:
Set up backup server information
If it disconnects at any time for anything the entire thing breaks, and it requires manually going in and deleting file(s) that stop it.
No notification system really AFAIK for failures, we did have one at some time but it was unreliable due to the way it had to check
No proper pruning, really terrible for loading it back in, would completely freeze and take forever to load in 1 by 1
No real good controls on timing, how it manages memory usage, or anything advanced like that
Uses the same disk as for VMs as far as I remember to package them up, zips and stores it first then shoots it off after that's done
Seems like it would take very random amount of time to move forward, doesn't take into account any resource levels
Our way so far, mainly improved/reworked/added features during Dedipath moves, but very segmented.
Fires off a notification if it breaks (unless it breaks breaks, need to improve to be more granular.)
Manages memory properly
Uses the appropriate procedure to clone a VPS while it's powered on with minimal corruption, first goes NVMe to NVMe but in a way where it doesn't need an arbitrary amount of space
Customized in a specific way to reduce risk of NVMe dropping or causing other problems, offset properly in terms of timing using a simple algorithm to reduce overlaps
Copies it over/simultaneously zips between NVMe an HDD, custom bs
Already has multiple checks, doesn't just proceed randomly in all cases until it breaks then do nothing, it makes sure appropriate memory amount is there, appropriate processing power, no load issue, that the hard drive is in the proper power state, space is available, and rechecks some things every single one it goes over instead of only at the beginning, if one breaks, all of them don't break, it continues with the rest as long as other checks are in place
Properly deletes any intermediary files, even if it closes out, doesn't leave it halfway. Properly deletes older backups at the right time as best as possible
Makes sure the filesystem it is going to is the right size, even makes sure it's actually going to a spinny drive, if it can't locate the default it checks a bunch of things and then creates it
Moves over the configuration files required
Keeps a 1:1 list of IDs to IP address, later on allows you to enter an IP address to load in a backup, that script also checks a bunch of things too for loading it in and you can also load in bulk. Other various customized versions/scripts that interact with this setup, but we won't get into those as this is strictly for the taking of backups portion.
Keeps track of folders for external backups (this isn't done on all nodes yet) and syncs the backups to an external source independently and granularly, doesn't do it unless necessary, and only does it after we first have the first backup, doesn't try doing everything at once, reduces errors in relation to networking, keeps going all the time, checks and syncs -- this was used for NYC which is ironic since that's what we got back up first and didn't need any backups but it improved backups when they're in two places and double checked basically, because if this failed it'd fire off another error
Properly puts the hard drive in power saving mode after its done
Oh and huge difference to SolusVM: doesn't break itself permanently until human confirmation, keeps retrying/going.
Not added in yet, this isn't an extensive list but it's what needs to be improved:
Automatically go through and check and make sure it's usable at a basic level otherwise marking it as corrupt
Better handling of failures at different levels, easier to check and fix so it doesn't go on a backlog
Figure out a solution to the inherent problem of nodes that statistically need backups having the most issues with backups simultaneously as a result (complex network, disk issues can cause backups to run into problems that don't get caught as well.) This ends up makin it seem like backups are all unreliable, nope, just mostly unreliable for nodes that have issues, and those are the ones that need it most so that's not good.
Have a quick overview page for all of them (easy/possible if implemented with idea presented which we're originally discussing.)
Now I'm not saying it's anywhere close to what we want right now, but at least for us we can say it's better than SolusVM.
So going back to why customers need to be involved more related to taking good backups and restoring it, there's still the issue of what happens once the node is up, now the customer has one backup version, they might want the fresh original version, they might not, too much back and forth required to actually give the customer exactly what they want, confirm with them what's available, what isn't, if we locate problems with each specific backup. If we do bulk restores, it's going to be missed to some level. Remember how the system is granular? Well this means it doesn't mean we either have backups for everyone or no one. So it gets segmented no matter what, now add that in to my original list of issues.
@FrankZ said: I see that you have already answered some of the issues I spoke of above while I have been writing this. Please understand that I really do not have any concept of what you deal with on a day to day basis and these comments are not meant as a criticism. You should do what you think is best for your company, because you are the one who will have to manage the outcome of those decisions. It is easy for me to play arm chair quarterback while not having to deal with all the real day to day issues you face.
No it's good to hear what a customer thinks, if anything whether we move forward with it or not I already know I have to make some modifications to how its all presented at least and it could even improve how we proceed without it in the future.
(edit) Yep, still shorter than the original long reply I didn't post.
@FrankZ said: but in NYC, Tokyo, or Los Angeles this might be a realistic option. It also helps that you already have storage servers in these locations so spares could be used for both your and clients backup servers.
We did this too. So actually, if we want to be very specific, these are for example all the backups we have right now for NYC/LAX, with LAX being nearby locations for that and less uniform.
On the node itself (on its own schedule/thing)
On the backup server for NYC, LAX
External copy #1
External copy #2 (for NYC, done in a different way)
Again it's just funny we have 4 copies for NYC, it completed basically perfectly, and the only ones we're missing are the ones we ended up needing. I think one issue with DALZ004 was that we couldn't run backups and migrations simultaneously, then migrations ran into issues, backups ran into issues, so we ended up with one corrupt set of data scattered and one set of incomplete backups that was incorrectly removed. So this one specifically, still investigating but very much "human error." Still doesn't make sense, it should be somewhere but we have to figure out who ran what when with what sub-system/script and that could all be avoided if we complete this project. This node was also having connectivity issues which complicated pretty much all the above, if it didn't have connectivity issue it shouldn't have happened like that.
(edit) I will say this, if we had a dedicated "backup specialist" who was in charge of all that, and did his job properly, spent the entire time validating everything and focusing on correcting it for a couple months, that's how we could get to like 99% reliability with backups. Or that could be our script that brings it all together, whether or not frontend is added for customers.
@VirMach said: We're not that crazy, it's not RAID 0, just no RAID for most of them.
I stand corrected. I went back and saw where you said it was just going to be single drives.
@VirMach said: The reason is the same as one of the things you listed above which is customer confusion.....
I can understand why you wanted to have customer interaction better after your explanation above. Seems like the lesser of two evils. I guess you could make a video to get around the reading comprehension issue I spoke of, people seem to stay focused on videos better than the written word. If the procedure is not going to involve SolusVM that removes my first issue as well.
@VirMach said: .... Now I'm not saying it's anywhere close to what we want right now, but at least for us we can say it's better than SolusVM.
Sure seems like it.
@VirMach said: So going back to why customers need to be involved more related to taking good backups and restoring it, there's still the issue of what happens once the node is up, now the customer has one backup version, they might want the fresh original version, ....
I did not take this into consideration, having my own redundant systems, as I consider my data to be my responsibility and not the providers. After reading your comments I see that most of your customers do not see it that way and you need to accommodate their expectations.
You seem to be well on your way to creating a better automated backup system, which I do understand is not an easy task at scale. Sounds like I was preaching to the choir based on what you said you have already accomplished. I well know that human error can happen, based on personal experience, so if DALZ004 was just an outlier, then it should not be a consideration in the larger issue.
Thanks for taking the time to explain all this. It was enlightening to have a little peek at what is going on behind the scenes at VirMach.
One thing I have gathered from this conversation is that I am a damn good person to have as a customer.
Seattle Update - Equipment being released, sending them boxes and packing material to arrive tomorrow, they may or may not already have some. Already sent them labels as well. Best case scenario looks like it'll arrive in San Jose on Saturday, in which case we've confirmed with someone that they can rack it that day. All up to Sabey at this point on the timeline.
Still working on setting up servers for loading in backups, most IPv4 announced, a couple waiting on IPXO.
Already got uplink ready, getting switch ready, already have servers ready but doing finishing touches for them so it's smooth. Trying to find ethernet cables/power though... I hope that's not our downfall. I simultaneously have hundreds & can never locate them. Asking QN if they have any and confirming PDU type, shouldn't be a problem I also have like 1000 feet of ethernet I can make myself.
(edit) Also hope they have serial, I left both my dongles at SJC. I mean I know they do just not sure if they have it for us or them and if they'd be able to help tonight otherwise. I do have others but again just mentioning all this so it's clear there's a possibility I could have it 99% ready for loading in starting tonight and we just miss it and have to continue in the morning, not doing an official status update until I confirm everything.
Oh I also have a little secret up my sleeve, I haven't drank coffee yet, got everything else done without it so far, now's the time so I make sure we can power through the SEA --> LAX backup loading into the morning after equipment gets set up tonight. Looking good, let's see if it lasts. Hey if I go downstairs and my car doesn't have the tires stolen off it we should be OK.
Super sad to see Seattle not work out. it was such a great location and I have a ton of nice VPS there. LAX makes them redundant, but I understand what has to be done. Just had fingers crossed that it would all get resolved.
Some complications. If that changes the timeline, I'll let you guys know. I'm having to do do a lot of last minute changes to make it work for tonight. Still looking good though overall.
I'm still around, just taking a break and having some food. Servers were finally ready ready as of maybe an hour ago. I couldn't find the dongle I need, my laptop is malfunctioning in many ways, and no one's at the facility (I assume until at least the morning) so I'm trying to time it for that instead. I basically either need to borrow their laptop or have them configure it. A lot of other fun stuff happened obviously to make it a pain as usual, but we'll just say it went well.
@sh97 said:
Hey Virmach,
How did the SEA -> LAX migrations go?
Had to end up going to sleep after meetings with Flex and Sabey and preparing for IPv4 changes. Someone stole our dolly, and someone blocked in my vehicle last I checked and was ready to leave at noon. No one blocking in car (it's a tandem spot, the car that works is in the back spot) but I have to figure out how to get everything downstairs right now without a dolly still, I guess I have to just make several trips.
Atlanta Update: It seems like their sales guy lied ... shocking, I know.
Outside of ignoring us after getting the most important part of his job done, we initially went with Atlanta because we were told PDUs were already set up, and he had a cabinet for us, ready to go. After speaking with the actual provisioning team about 5 days after the point (so basically the day it was supposed to be completed, like the entire move) they confirmed with us that the power is in fact not even set up yet and they had finished networking that day.
I did just get an update though right before sleeping:
I have good news. You’re power is slated to be installed tomorrow along with your Flexential Cabling and PDU’s.
Our contract with Flexential ends in a year, I'm just also letting you guys know "unofficially" that we'll probably not be renewing it, so we'll see what happens within the next year, but this is your unofficial notice that your service in Atlanta will probably have some maintenance window and facility change in September 2024. I don't like the way they've treated us so far, everything screams that they cannot be trusted long-term.
The same can probably be said about Hivelocity, and that'd be around June/July 2024. Unless they magically decide to stop bothering us but I don't believe that to be the case, plus we cannot trust them enough to send more equipment since they've already at least once threatened us in some way, one of them being specifically to hold equipment hostage during an outage they caused, instead of de-escalating.
@ConnersHua said:
TYOC038 can't connect to the panel today after a hard disk failure yesterday
My Server is offline...
We received an alert by our system already, network status wasn't added. It's working as intended, it intentionally breaks the panel to prevent people from re-installing and making the situation work. This will likely be worked on with AMSD030 on Sunday, I'll add that into the update.
@cornercase said:
So I guess my SJCZ005 was moved to LAXA007 however currently shows up with no OS installed. Should I be expecting a backup to load in here?
I don't believe that should have happened. Make a ticket so I can check.
Comments
I'll provide a scenario for how the above potential feature is envisioned.
Let's say a node goes down. A pop-up appears saying "Hey, this might be facing issues. Want to deploy a backup service?"
You click yes, it generates a new service with the same specifications in a nearby location or same location if still available, if it succeeds, it loads in and gets marked as your "secondary" service. To a "secondary" service. Your service details page marks the "secondary" service as the "active" one, and shows you a notice that your other service has been moved to being in a passive state and will be terminated automatically in 7 days. You can visit it in the second tab.
The "secondary" service, when marked active, will also have a "load in backups" option. Checks for one, loads it in. If it fails, it tells you to re-install an operating system instead. The "active" service being the secondary means you get full controls on it.
The decision is now yours on how you want to proceed. By default, the "secondary" service moves to being the permanent new service. After 7 days, we assume we're doing a terrible job at restoring the original and it goes away from your view. Otherwise if it comes back, within these 7 days, you can instead choose to go back to the original service, visit the tab, and switch it to the active and confirm the "secondary" will go in a passive state and be deleted when timer runs out instead. You can technically switch back between these as much as you want if both are available during the 7 days, but when it's switched one powers down when going to inactive state. We may add a button for temporary power on of inactive service when available, one-time, for like 8 hours to try to transfer files between them.
This process also helps us set a deadline for fixing nodes, if it's not fixed after 7 days it'll still appear in your "history" later on as a feature on the second tab, where you can contact support about it (we let you know beforehand in a pop-up if nothing can/will be done about it, it'll just get marked as rejected, no human response unless we can do something.)
"Oh, he signed the agreement? He's a Flex customer now? Turn off his power, stop replying to him."
My personal opinion is that you are adding layers of complexity on to a system, SolusVM/WHMCS, that I expect will only cause additional issues, and even more support tickets. If you are trying to eliminate extended down times for customers it might be better to be proactive as opposed to reactive. Working on the root causes of these issues may be a better way to go about it. The frequency of reoccurring extended down time issues at VirMach is not a normal thing in the industry.
When a node has an unsolvable hardware issue like LAX2Z019 or DFWZ007 just telling people that their VM is going to be recreated somewhere else is the right answer IMO. I don't like being told that I am going to have to spend a few hours rebuilding something, but it beats waiting 1-6 weeks and then being told I need to rebuild it anyway.
IMO simplifying your operating methods, and working with reliable, responsive partners is the way to solve these issues.
EDIT: Of course since I am looking at things from just a customer's perspective I may not understand all the issues involved.
For staff assistance or support issues please use the helpdesk ticket system at https://support.lowendspirit.com/index.php?a=add
Shorter version of my reply, if the issue is we have a bad time during emergencies trying to load in new services, do you think it could be worse with the self-serve optional system? Or I guess how specifically, what do you envision would go wrong where it's worse than the current alternative of just being down? I have some answers myself as I've thought it out, wondering what yours are for that.
As for unsolvable issues, right answer being recreate somewhere else I'll present you the issue that comes with this and that's customers who value their data more than a service being up, and just get angry if a new service goes up with no data. There's a good number of these, where they misunderstand the situation, and then on top of that moving more people is more work, especially while also dealing with the other work involved.
I feel like we've nailed this part, even if it doesn't seem like it, due to everything else going on. The latest fix looks great on every node except the ones with a specific board we have like maybe 3 or 4 of right now, most of those already got retired. So we've already handled most if not all of the "proactive" and this would be the "reactive" portion. The balance isn't like 100% to 0% proactive to reactive, there still needs to be some changes made to the reactive side.
But yes generally agreed, I have the same philosophy.
We definitely wouldn't do this first, and then try to fix the problem. We only do that in emergency scenarios so like for example, when we did the VLAN splitting that wasn't fixing the actual issue, it was just a Band-Aid, we do those too when necessary, this isn't one of them.
Sabey/Seattle update I just got, as usual their guy is really nice and just trying to help out and the delays were on Unitas side basically plus him and myself being busy and just trying to coordinate everything on each end. Anyway, it looks like they can just ship out our stuff, it's not 100% yet but here's the game plan since we're adding that into the mix:
Whichever gets people up faster, we go with that. We send people an email in any case just to confirm what happened and what else they can do (if they end up going up in San Jose, good, they stay there for now and we'd know which nodes to offer migration back to Seattle later optionally if we keep Seattle, if they end up going to LAX from backups, we make a list of anyone that wants to be loaded in from San Jose to LAX or LAX to San Jose or whatever in a neat way.)
As for Seattle going back up in Seattle, basically no-go on that, it's going to probably take 2 weeks after all that to get transit. Could be expedited for a high fee but not guaranteed, and it'd be difficult to do the right blend/IX right now in any case. It seems like we need to just do a relaunch for that location. Even expedited would still likely run into next week mid-week and that's if everything goes perfectly.
I think going back to trying to make improvements/adding more clunky features, we usually focus on how it goes wrong and then take what went right as the "default."
If we go back to stock WHMCS/SolusVM, I'm just trying to envision it right now, even with all the extra issues it may have caused at one time, I think the general vibe is that it was overall positive, saved us time. Like I actually cannot imagine being able to run things as efficiently as we do (whether that's 5% efficient, 50% or whatever) without all our weird scripts.
My concerns here would be:
1. SolusVM or WHMCS makes a change that breaks this option after you implement it.
2. That customers will misinterpret what is going to happen when they activate this option, because many people seem to have a problem with reading comprehension and it is not something they are familiar with from other providers.
3. That both 1 & 2 will create tickets that require longer back and forth responses than when it is just down.
I can understand clients who are paying you big bucks having these expectations, but not the lower end market. My understanding is that you are using RAID 0 because of various reasons relative to the drives and the Ryzen boards you had available at the time of build. In this system, like most hardware systems, it is not a question of if it will fail, but when. If you want to be able to restore customer data from backups make a good backup system and when a node has been determined as having a catastrophic failure, restore backups to a new node. I don't understand why the customer needs to be involved in this choice. I understand that this can't be done in every location because it would be difficult to justify a non utilized spare node in places like Denver, but in NYC, Tokyo, or Los Angeles this might be a realistic option. It also helps that you already have storage servers in these locations so spares could be used for both your and clients backup servers.
I also understand that places like Tokyo or Frankfurt where we used to hear about issues all the time have been silent for a quite a while now. I expect that is because you have resolved the hardware/firmware/kernel issues that were the cause of these issues and xTom is a good partner. I liked your OKC idea because I thought it might eliminated what I think has been your biggest issue, reliable partner DCs. There is not much you can do about things on the other side of the country if your DC partner will not pick up the phone or answer your ticket. So IMO this is a big part of of proactively solving the longer term issues.
I saw as a positive that you are moving Seattle to Los Angeles, I have a couple of VMs there and can understand why others may not feel so positive about the move, but if you see that Sabey can't work in a reasonable amount of time it is better to get out than continue adding complications.
I see that you have already answered some of the issues I spoke of above while I have been writing this. Please understand that I really do not have any concept of what you deal with on a day to day basis and these comments are not meant as a criticism. You should do what you think is best for your company, because you are the one who will have to manage the outcome of those decisions. It is easy for me to play arm chair quarterback while not having to deal with all the real day to day issues you face.
I like what you have done with the WHMCS changes so far, and if it makes you more efficient all the better.
For staff assistance or support issues please use the helpdesk ticket system at https://support.lowendspirit.com/index.php?a=add
We're not that crazy, it's not RAID 0, just no RAID for most of them. The ones that we did end up doing "hardware" RAID10 are the ones that ended up having the biggest problems as I suspected could happen, because now it's another set of drivers that work less well especially when it comes to any updates. It also doesn't seem to work properly where if one drops off it does fine, same issue if one drops off in RAID10 so it's like basically no point in having it be RAID10 other than maybe allowing for recovery, maybe, but that's also a finnicky process. If an NVMe drops off, whether in "hardware" RAID10 or no RAID, same thing basically but with even more headache in RAID.
SolusVM released a feature that we could use to handle all this a little bit better, but that'd require more customization using the new API which I decided against, ironically for the reasons you described. I'll think about that further for this other change but I guess when it comes to something like disk I automatically agree with you, for this other change it's hard to gauge whether it could end up being worse.
I guess one way we could do it is roll it out for a limited number of servers in beta for a few months, or rather just try it out as one downtime occurs in the future and compare.
I answered this in my long version which I didn't post, so I'll copy over the relevant portion.
Actually nevermind, it's so long and it's faster to just write it out again versus reading through to find it.
The reason is the same as one of the things you listed above which is customer confusion. So what we noticed when we did the recreations on a new node, was that people would get confused, over various periods of times in different ways. Like they'd not read anything, misunderstand, then end up getting angrier than if we had just left it offline. If it's offline, or request something themselves, they're aware of it. Otherwise they assume offline, something's wrong. If it's online, but now it doesn't have an operating system, it's in a different location, different server, different IPv4 address, those can all be seen as individual issues. Others it might get worse and they think now that their data is 100% lost, so either they don't contact us to proceed with any backups being loaded in, or just go off on us as if we had already definitely lost data and it goes into those types of tickets that take a very long time to defuse. Once they're already mad, it doesn't matter if we let them know what's actually happening, they're mad and get mad at every little part of the process and how it's done. And if we have to do all that work for it manually anyway and for everyone even if they do not want it, then by default it's more work, plus possible human error. If we automate it, then it's going to be doing the same thing and run into the same problems just in bulk and without any good refinements over time. One good thing about self-serve is it basically addresses all this and if it breaks the customer will also see it broke, and then it will definitely (okay probably not definitely, but as close as it gets) correctly be attributed by the customer as "I clicked this feature, and it broke." Versus "why did you lose my data / why is my IP different / why does it say no disk / where did my service go" and then also getting all the tickets about whatever else related to us just automatically redeploying everyone, even with communication.
Well this would be that. If your issue with it is it being presented to the customer as anything, and think we should just keep it on the backend then that could probably be done. I guess in a broad sense I see it this way, if we're trying to have a self-driving system for VirMach private roads, we might as well just apply that same idea for customer usage and then that helps refine it more quickly.
I guess I didn't really dive into a lot of the backend stuff, but it incorporates the actual backing up part, organizing it to where it can be done securely and neatly by the system. We already basically have our own scripts that are better than SolusVM's way of doing everything (they did finally "fix" some major issues with backups in latest release, but who knows if that actually means they didn't create a second problem.)
So let me do a quick comparison for you where we're at, this is compared to how SolusVM did it when we last checked, not most recent version. I didn't want to go into this part because it would end up being long, but here we go, now I have a good overview of this system as well I can save for documentation.
SolusVM:
Our way so far, mainly improved/reworked/added features during Dedipath moves, but very segmented.
Not added in yet, this isn't an extensive list but it's what needs to be improved:
Now I'm not saying it's anywhere close to what we want right now, but at least for us we can say it's better than SolusVM.
So going back to why customers need to be involved more related to taking good backups and restoring it, there's still the issue of what happens once the node is up, now the customer has one backup version, they might want the fresh original version, they might not, too much back and forth required to actually give the customer exactly what they want, confirm with them what's available, what isn't, if we locate problems with each specific backup. If we do bulk restores, it's going to be missed to some level. Remember how the system is granular? Well this means it doesn't mean we either have backups for everyone or no one. So it gets segmented no matter what, now add that in to my original list of issues.
No it's good to hear what a customer thinks, if anything whether we move forward with it or not I already know I have to make some modifications to how its all presented at least and it could even improve how we proceed without it in the future.
(edit) Yep, still shorter than the original long reply I didn't post.
We did this too. So actually, if we want to be very specific, these are for example all the backups we have right now for NYC/LAX, with LAX being nearby locations for that and less uniform.
Again it's just funny we have 4 copies for NYC, it completed basically perfectly, and the only ones we're missing are the ones we ended up needing. I think one issue with DALZ004 was that we couldn't run backups and migrations simultaneously, then migrations ran into issues, backups ran into issues, so we ended up with one corrupt set of data scattered and one set of incomplete backups that was incorrectly removed. So this one specifically, still investigating but very much "human error." Still doesn't make sense, it should be somewhere but we have to figure out who ran what when with what sub-system/script and that could all be avoided if we complete this project. This node was also having connectivity issues which complicated pretty much all the above, if it didn't have connectivity issue it shouldn't have happened like that.
(edit) I will say this, if we had a dedicated "backup specialist" who was in charge of all that, and did his job properly, spent the entire time validating everything and focusing on correcting it for a couple months, that's how we could get to like 99% reliability with backups. Or that could be our script that brings it all together, whether or not frontend is added for customers.
I stand corrected. I went back and saw where you said it was just going to be single drives.
I can understand why you wanted to have customer interaction better after your explanation above. Seems like the lesser of two evils. I guess you could make a video to get around the reading comprehension issue I spoke of, people seem to stay focused on videos better than the written word. If the procedure is not going to involve SolusVM that removes my first issue as well.
Sure seems like it.
I did not take this into consideration, having my own redundant systems, as I consider my data to be my responsibility and not the providers. After reading your comments I see that most of your customers do not see it that way and you need to accommodate their expectations.
You seem to be well on your way to creating a better automated backup system, which I do understand is not an easy task at scale. Sounds like I was preaching to the choir based on what you said you have already accomplished. I well know that human error can happen, based on personal experience, so if DALZ004 was just an outlier, then it should not be a consideration in the larger issue.
Thanks for taking the time to explain all this. It was enlightening to have a little peek at what is going on behind the scenes at VirMach.
One thing I have gathered from this conversation is that I am a damn good person to have as a customer.
For staff assistance or support issues please use the helpdesk ticket system at https://support.lowendspirit.com/index.php?a=add
Seattle Update - Equipment being released, sending them boxes and packing material to arrive tomorrow, they may or may not already have some. Already sent them labels as well. Best case scenario looks like it'll arrive in San Jose on Saturday, in which case we've confirmed with someone that they can rack it that day. All up to Sabey at this point on the timeline.
Still working on setting up servers for loading in backups, most IPv4 announced, a couple waiting on IPXO.
Already got uplink ready, getting switch ready, already have servers ready but doing finishing touches for them so it's smooth. Trying to find ethernet cables/power though... I hope that's not our downfall. I simultaneously have hundreds & can never locate them. Asking QN if they have any and confirming PDU type, shouldn't be a problem I also have like 1000 feet of ethernet I can make myself.
(edit) Also hope they have serial, I left both my dongles at SJC. I mean I know they do just not sure if they have it for us or them and if they'd be able to help tonight otherwise. I do have others but again just mentioning all this so it's clear there's a possibility I could have it 99% ready for loading in starting tonight and we just miss it and have to continue in the morning, not doing an official status update until I confirm everything.
Oh I also have a little secret up my sleeve, I haven't drank coffee yet, got everything else done without it so far, now's the time so I make sure we can power through the SEA --> LAX backup loading into the morning after equipment gets set up tonight. Looking good, let's see if it lasts. Hey if I go downstairs and my car doesn't have the tires stolen off it we should be OK.
san joseeeeeeeeee
I bench YABS 24/7/365 unless it's a leap year.
Super sad to see Seattle not work out. it was such a great location and I have a ton of nice VPS there. LAX makes them redundant, but I understand what has to be done. Just had fingers crossed that it would all get resolved.
Fucking Dedipath.
Some complications. If that changes the timeline, I'll let you guys know. I'm having to do do a lot of last minute changes to make it work for tonight. Still looking good though overall.
Still working on it, at least a couple hours past when I originally wanted to head there but will be solved.
I'm still around, just taking a break and having some food. Servers were finally ready ready as of maybe an hour ago. I couldn't find the dongle I need, my laptop is malfunctioning in many ways, and no one's at the facility (I assume until at least the morning) so I'm trying to time it for that instead. I basically either need to borrow their laptop or have them configure it. A lot of other fun stuff happened obviously to make it a pain as usual, but we'll just say it went well.
3/3 of my vpses are up. problem?
I bench YABS 24/7/365 unless it's a leap year.
LAX2Z019 has been having problems for more than a month, who knows when it can be used again
thats vbad!
I bench YABS 24/7/365 unless it's a leap year.
From the status page regarding LAX2Z019.
For staff assistance or support issues please use the helpdesk ticket system at https://support.lowendspirit.com/index.php?a=add
Shouldn't live migration start to work automatically then?
At least some providers here offer manual migration when downtime is more than a few days.
MicroLXC is lovable. Uptime of C1V
TYOC038 can't connect to the panel today after a hard disk failure yesterday
My Server is offline...
Hey Virmach,
How did the SEA -> LAX migrations go?
The Ultimate Speedtest Script | Get Instant Alerts on new LES/LET deals | Cheap VPS Deals
FREE KVM VPS - FreeVPS.org | FREE LXC VPS - MicroLXC
Had to end up going to sleep after meetings with Flex and Sabey and preparing for IPv4 changes. Someone stole our dolly, and someone blocked in my vehicle last I checked and was ready to leave at noon. No one blocking in car (it's a tandem spot, the car that works is in the back spot) but I have to figure out how to get everything downstairs right now without a dolly still, I guess I have to just make several trips.
So resuming figuring that out now.
Atlanta Update: It seems like their sales guy lied ... shocking, I know.
Outside of ignoring us after getting the most important part of his job done, we initially went with Atlanta because we were told PDUs were already set up, and he had a cabinet for us, ready to go. After speaking with the actual provisioning team about 5 days after the point (so basically the day it was supposed to be completed, like the entire move) they confirmed with us that the power is in fact not even set up yet and they had finished networking that day.
I did just get an update though right before sleeping:
Our contract with Flexential ends in a year, I'm just also letting you guys know "unofficially" that we'll probably not be renewing it, so we'll see what happens within the next year, but this is your unofficial notice that your service in Atlanta will probably have some maintenance window and facility change in September 2024. I don't like the way they've treated us so far, everything screams that they cannot be trusted long-term.
The same can probably be said about Hivelocity, and that'd be around June/July 2024. Unless they magically decide to stop bothering us but I don't believe that to be the case, plus we cannot trust them enough to send more equipment since they've already at least once threatened us in some way, one of them being specifically to hold equipment hostage during an outage they caused, instead of de-escalating.
Man, this is a blog that just keeps on giving.
Poor, poor Vir. (said with a N. Irish accent.)
It wisnae me! A big boy done it and ran away.
NVMe2G for life! until death (the end is nigh)
We received an alert by our system already, network status wasn't added. It's working as intended, it intentionally breaks the panel to prevent people from re-installing and making the situation work. This will likely be worked on with AMSD030 on Sunday, I'll add that into the update.
I will just ask... is Flexential really that much cheaper than xTom?
// Oh, xTom has only one location is USA, wtf
Haven't bought a single service in VirMach Great Ryzen 2022 - 2023 Flash Sale.
https://lowendspirit.com/uploads/editor/gi/ippw0lcmqowk.png
So I guess my SJCZ005 was moved to LAXA007 however currently shows up with no OS installed. Should I be expecting a backup to load in here?
I don't believe that should have happened. Make a ticket so I can check.