Being observant and noticing things and events others don't notice, is always an advantage, even if you are not a detective.
Many years ago, I had a housekeeper that had no luck with vacuum cleaners - once per month she was leaving a note “the vacuum cleaner got broken, please buy a new one”. Shopping is not my thing, so I was passing these notes to a colleague whose tasks include supplies for our office, or the housekeeper was contacting him directly when I was out of the country. Years passed, with more and more vacuum cleaners failing. Meanwhile, the housekeeper got married, and one day I came home to pick something during her working time and noticed that she was pregnant.
Cleaning is a hard work, and it is definitely not for a pregnant woman. However, you cannot fire someone just because of being pregnant, and usually, maternity leave starts just days prior to the birth term. When I asked, she said that her birth term was in five months. I just handed her five salaries and told her she did not have to clean my home until the start of her maternity leave. The next day I had to clean by myself, while waiting for new housekeeper candidates to appear.
My home had a room where the housekeeper kept her tools. I went to that room for a first time in four years, and I found a large heap of vacuum cleaners, literally tens of them! She kept their boxes, and the room looked like a vacuum cleaner store, with various brands and models. Probably my colleague was experimenting with brands and models to stop the phenomenon, obviously without success. But how was that possible? I opened the first box and disassembled the device. What I found was that the bag was full of dust, thus the sensor was preventing the electric motor from starting. After changing the dust bag, the vacuum cleaner started normally. Curious about other boxes, I have unpacked a few more just to find the same reason of “failure”. Obviously the housekeeping lady had no idea that a $100 vacuum cleaner was not annihilating dust, but was just storing it. The next day I had a vacuum cleaner with a water filter instead of a dust bag, and I made sure the next housekeeper knew how to work with it and remove the dirty water after work.
How could I know that someone was not aware of such a simple matter? But this case taught me a valuable lesson -always expect anything, never assume others think and see the same way you think and see!
In this case with the vacuum cleaners, it was my oversight for not noticing the process for years, simply because I was not paying attention to such matters. However, when it comes to by business, my supervision is quite strict, and I would dig into each case thoroughly, especially if something is failing (server, or a component, for example).
My colleagues at ICDSoft are obligated to inform me about each problem, even the least important ones. In the very beginning of my web hosting business, I was able to notice trends my colleagues did not notice, as they were working on shifts. For example, a server having problems (high CPU load) several times in three consecutive days. Later, we arranged a very strict and neat system of maintenance logs, where detailed report of each issue was stored. Now, that system issues a warning if it detects a consequence of events that call for more attention and specific measures.
At the moment, there are 1360 solid state drives and 1169 magnetic hard drives running on our servers (SSDs on production servers, HDDs on backup servers). Due to this fact, about five to eight disks fail each month. All disks are in RAID arrays, so a failure of a single drive does not cause any data loss. Whenever it happens, we just send a request to the datacenter technicians: "We need Disk9 on Server351 moved to the 'Failed disks' box and replaced with a disk from the 'SSD4' box."
Every few months, we purchase a batch of 100-200 SSDs. We keep a record for each drive - model, size, batch - and it happens our system to warn us that a certain batch is experiencing an unusually high failure rate. According to my statistics, it happens to any brand. You will see me mentioning luck quite often - yes, luck plays an important role in our lives, and no matter how hard we fight, luck will always be a factor. So whenever we find an "unlucky" batch of problematic SSDs, we just replace them all with another batch and just scrap them. We have even had bad luck with a full batch of servers. We keep new servers running heavy loop cycles for weeks before they actually deserve the honor of being production servers at ICDSoft. During the testing period, we noticed a mass problem with this batch, and to be on the safe side, we just disassembled these machines.
Next time I will show you how we scrap old and failed drives. Formatting them is simply not enough, so we transport them to our mountain retreat in the Bulgarian countryside. Then we put them in a place where our customers' data will be completely safe - we "embed" the drives in concrete foundations whenever we build new underground facilities.
When a customer submits a support ticket, our SureSupport ticketing system adds a record to the tickets table. That record consists of date, time, status, domain name, subject, the number of the server where account resides, time since ticket was submitted (in seconds) and status. Our operators see the latest 30 tickets on the page. If there is more than one ticket for a certain server, these tickets are highlighted with the same purpose - such matching may be a clue for a problem that could be affecting all users on the machine.
Being observant is definitely an advantage.
P.S. Anyone wants an almost new vacuum cleaner at a good price? Buy five, get ten! ;)