us1 stream is down again...

...this is becoming a much too regular occurrence. Is Linode still experiencing DDoS attacks? I don't want to have to move my stream yet again, but thinking I may have to. We can't have this much downtime.
 
This is what comes up after trying to start streaming:
Processing account a1061 ...Unable to access account for a1061: Cluster host connection failure for us1: Connection timed out (110)
Processing account a937 ...Unable to access account for a937: Cluster host connection failure for us1: Connection timed out (110)
Processing account a989 ...Unable to access account for a989: Cluster host connection failure for us1: Connection timed out (110)
 
Apologies for the issue. This does not appear to be a DDOS attack. The server is up but networking is down for some reason. As advised from the data centre we have issued a reboot and this has bought the machine back online. We're investigating the cause at the moment.

Normally we would have caught his problem earlier but as the server went offline in the middle of the night we only got silently notified rather than an audible alarm. We've have just now modified this behaviour so an actual alarm is sounded regardless of the time of day. Apologies again for the down time.
 
We've had a look through the log files and it appears the server crashed just after 22:08:24 GMT yesterday (7th June). The reason for the crash is unclear and we haven't seen this happen before. We are monitoring for this sort of unusual behaviour as of now so we will be alerted sooner if this happens again. Apologies once again.

All us1 servers have been brought back online and should be up and running as of about 20 minutes ago.
 
We can confirm that we experienced a 10h downtime this time from June 6 22:10pm to June 7th around 8am.
We need to remind you that we are a Radio Station broadcasting 24-7. 10h long downtime is a long period of time on air without anything.
We also need to explain what happened on our official website, facebook, google page etc and apologize to listeners.
 
I must agree with Megaton Cafe Radio here. I appreciate your apology, but we too are a 24/7 streaming network with obligations to our hosts, sponsors, and listeners and "I'm sorry but we didn't have an alarm to let us know" is hardly a makegood on a situation that, at least for us, is particularly problematic for us right now. This week we have an obligation to a partner (Brooklyn Pride Festival) who are in the middle of their event and are expecting coverage and promotional consideration from Radio Free Brooklyn, not dead air. You made us look terrible. Somehow, "oops, sorry" isn't quite good enough.
 
We totally agree that 10 hours a long time and not normal. All we can do is apologise profusely and assure you that changes have been made so this situation doesn't happen like this again. We've never had a server crash this way before so we've updated our procedure to cater for it just in case it happens again. We're also looking in to the cause of the crash. It seems to be related to Centova Casts mp3gain (stream normalisation) software.
 
Sorry about that. I had to restart the machine as it crashed again. I caught it and it only meant a reboot and 10 minutes of downtime rather. I also did manage to catch a glimpse at the console as it happened:

[<ffffffff81003929>] ? do_fast_syscall_32+0xc3/0x132
[<ffffffff819c85f2>] ? entry_SYSENTER_compat+0x52/0x70
Task dump for CPU 4:
icecast R running task 0 22239 1 0x00000008
0000000000000246 ffff8800ba446608 ffffffff818e06ae ffff8800aa705564
0201ffff4c3b7705 ffffffff820ed040 0000000000000003 ffff88033fd19300
ffff880333001500 0000000002090220 0000000000000246 ffff8800ba447170
Call Trace:
[<ffffffff818e06ae>] ? ipt_do_table+0x586/0x5ae
[<ffffffff81110b4c>] ? add_wait_queue+0x17/0x41
[<ffffffff817b5b33>] ? compat_sock_ioctl+0xa45/0xa45
[<ffffffff81110aba>] ? remove_wait_queue+0x13/0x4d
[<ffffffff811e4fc5>] ? poll_freewait+0x3b/0x88
[<ffffffff811e5f5b>] ? do_sys_poll+0x38b/0x40c
[<ffffffff8189249e>] ? ip_finish_output2+0x245/0x291
[<ffffffff810465b3>] ? kvm_wait+0x36/0x4d
[<ffffffff811125f8>] ? __pv_queued_spin_lock_slowpath+0x108/0x240
[<ffffffff819c5f06>] ? _raw_spin_lock_irqsave+0x2d/0x34
[<ffffffff811c66b1>] ? __slab_free+0xab/0x264
[<ffffffff81101dd8>] ? get_nohz_timer_target+0x1e/0x8f
[<ffffffff815bb5b5>] ? copy_page_to_iter_iovec+0xdf/0x29a
[<ffffffff811c77e1>] ? kmem_cache_free+0x163/0x1bf
[<ffffffff8189b030>] ? tcp_recvmsg+0x649/0x887
[<ffffffff818bda88>] ? inet_recvmsg+0x74/0x85
[<ffffffff817b6ee7>] ? SyS_recvfrom+0xb3/0xfc
[<ffffffff8112af77>] ? ktime_get_ts64+0x50/0xc7
[<ffffffff811e513b>] ? poll_select_set_timeout+0x53/0x74
[<ffffffff811e6076>] ? SyS_poll+0x4d/0xb6
[<ffffffff819c622e>] ? entry_SYSCALL_64_fastpath+0x12/0x71

Contrary to my previous belief it's actually an Icecast 2 server and not an ices encoder that's crashing which is really odd. This should help me in diagnosing this random crash issue.

In the meantime I would like to offer to move your server to a different machine. We can move you to Newark or Fremont. Your hostname and port would change but everything else would remain the same (settings, reports, start page, logs, login...). Is this something you would like me to do while we get to the bottom of this most unusual issue ?

Apologies once again.
 
Thanks for the offer, but this isn't really a practical solution for us, as that would mean changing our stream address everywhere we are listed - iTunes, Tunein, and all the streaming radio directories. If there was a way to do this without having to change the url in all those directories, we'd consider it. Otherwise, we'll take our chances on the current server, and if there are still issues when our season turns over in mid-November, we'll consider other options. Thanks.
 
Sure, that's understandable. Thanks for bearing with us. We're still unable to find the exact cause of this despite narrowing it down to probably Icecast2 so it's getting likely we will build a new server from scratch and migrate everyones us1 accounts across. Fortunately the way we have the network setup now we can easily rotate the IP's ensuring hostnames / IP's / ports (actually everything apart from the underlying software) remains the same. Unfortunately there will be a little downtime while this occurs. We would of course give everyone plenty of notice and keep downtime to a minimum. We'll make an announcement once we have something more concrete.
 
Top