|
Post by Admin on Dec 13, 2016 23:11:20 GMT 1
Since most of this section's topics concern hardware maintenance and downtime, this one thread shall be the reference point for all such future communications.
This way, hardware events won't pollute the more useful discussions that might happen on this forum section.
|
|
|
Post by Admin on Dec 13, 2016 23:35:51 GMT 1
Disk failure.
A hard drive in the server became intermittently non-responsive since yesterday. This causes intermittent I/O overload which prevents ATR from accessing the database and respond to users' queries.
A replacement disk has been ordered, since I had no spares. It will be mounted within a week. No data is lost due to RAID1 replication and two-steps backups.
The culprit is a Seagate Barracuda LP 2TB, with 42,000 hours of uptime. We may agree that the drive is old enough, though I had Hitachi disks with twice the life.
|
|
|
Post by Admin on Mar 7, 2017 18:08:19 GMT 1
Down for DNS outage.
My DNS provider, Namecheap Inc, is having its regular "disservice as a service" outage phase. The ATR server won't be reachable until they fix my DNS.
|
|
|
Post by Admin on Mar 12, 2017 15:25:51 GMT 1
Down for halted routing
Following an upgrade to both Linux kernel and to LXC, the network routing rules stopped working without any warning nor me knowing. I realized this yesterday and finished fixing the routing this afternoon.
All is functional at the moment.
|
|
|
Post by Admin on Nov 22, 2017 3:24:41 GMT 1
Discovered DNS outage
One hour ago I accidentally discovered that the DNS serving this add-on has been nonfunctional for over two days and counting, rendering the backend server used by ATR unreachable by any client request during this time, including right now, as the issue is not yet solved nor solvable by my side of the cable.
I opened a support ticket with the DNS provider, namecheap.com, and now they are supposed to resolve this issue ASAP.
As soon as the DNS returns functional, ATR will resume functioning with no further need for human intervention.
|
|
|
Post by Admin on Dec 7, 2017 23:27:02 GMT 1
Sorry for the additional hours of downtime today, I had to replace the fan of one CPU which caused overheat. After the fan was replaced, I got unexplainable software segfaults and memory corruptions. It turned out there was a dust particle struck between one of the RAM module pins and its motherboard socket. Apparently simply opening or manipulating the case was culpable.
This is the second time that debris causes me a malfunction; the first time was due to a tin particle from soldering wire causing a short between two leads of a PWM IC on the same motherboard. The tin piece was so small and light that likely entered the case by being sucked thru the air filter; the computer in fact sits on the floor of my workshop, and the dust can be too much at times.
|
|