Digital Services Paging System Now Available on GitHub

Nov 12, 2013

The Digital Services Paging System is now available as open source on GitHub.  You can use it to coordinate off-hours staff communications, emergency support, or countless other text-based messaging needs.  To understand how it evolved, let's step a bit back in time.

When we were a small tech company, rounding up our staff for off-hours emergencies was a pretty easy task.  We literally had an emergency@ds.npr.org email address that sent a text message to everyone's phone via the carrier's email-to-SMS gateways (xxxxxxxxx@txt.att.net, xxxxxxxxxx@vtext.com for Verizon, etc).  And for a while that worked well.  But as we grew in staff and diversity of problems we wanted to be notified about, a number of issues began to become clear:

  • Technical:  If you send the same message to too many different people on the same phone carrier's network, the carrier's spam filter starts to drop your messages.  Some staff members randomly don't get the alert.
  • Social:  If staff want to reply to each other in real-time (i.e. text instead of email) --and our policy was for the first responder to always reply with some form of "I got it" so that not everyone would have to run back to their laptops at once-- then everyone's phone needs to be pre-programmed with everyone else's cell phone number.  That's an update for the entire tech department every time someone new is hired.
  • Social:  If the alert goes out at 3am and you're using a single email address to SMS solution like we were, that's your entire department that gets woken up, every time, even if it only takes one person to fix the problem.  Let's call this the system's inability to target the correct staff member.
  • Social:  If you use Nagios, Zabbix or another network monitoring tool, you can get some degree of targeted alerting.  They can be configured to say "if server A goes down, page these three people.  If server B goes down, page these five people."  But this leads to another version of the 3am problem where you've been woken up, you want to reply with your "I got it!" so the other folks can go back to sleep, but you don't necessarily know which people the network monitor paged.  If you don't send your reply to all of them, you're duplicating effort with more groggy people still getting out of bed.  If you send your reply to more than the ones that got the initial page, then you're waking up additional people.  

For these reasons and several others, we decided to come up with a better paging solution.  It's one that's evolved with us over time.  And today we're releasing it as open source for anyone else that might be plagued with similar communications woes.

Enter DSPS (the Digital Services Paging System).  The quickest way to start learning about DSPS is by learning about how you interact with it.  On your phone DSPS is just another person in your Contact list with their own unique phone number.  All of your interactions with it are as text messages and they're self-contained as a "conversation with that person."  For example, on my phone I call the DSPS contact "NPR Paging."  That way no matter who pages me via DSPS, my phone will say I have a text from NPR Paging and I'll have some context.  DSPS automatically inserts the sender's name in the message if it's a person paging me, as opposed to Nagios.

Once you've configured the systems with the names of your staff and group names (collections of staff members), sending a message to someone is as easy as mentioning their name anywhere in the message:

  • "Hey Rick I need some help with this authentication problem."
  • "The caching server is down.  Can someone from dev jump on IM?"

Assuming you have a user named Rick the first message will go to him.  Assuming you have a group named dev the second message will go to everyone in that group.  If your staff is larger names can be FirstLast, network ID ("rennis") or anything else you can think of.  For very large groups, there's also an option that says don't recognize names in messages unless they're preceded by an @sign, like Twitter references.

Once someone or people are pulled into a conversation in that way, DSPS tracks the participants in the room (the "audience") and any message from any of those individuals to DSPS is automatically copied to everyone in the room, like a group chat.  Additional people or groups can be pulled in at any time.

From there the DSPS feature set grew and now includes the ability to:

  • Accept Nagios alerts and apply them to particular groups of users
  • Accept messages from email and apply them to particular groups of users
  • Define on-call schedules and send the alert to a different person depending on the week
  • Define automatic escalation polices.  e.g. when an alert matching specific conditions comes in, send it to a particular person (or a rotating "on-call" person) and if that person doesn't reply within a configured amount of time, re-send the alert to a larger group of people
  • Automatically throttle - if your network has a serious issue where Nagios or Zabbix starts bombarding you with texts, slow things down
  • Filter - Via text commands on your phone have the system ignore Nagios-style pages of a particular type (all of Nagios, Load & Recovery pages for when nightly maintenance is complete, or a specific regex)
  • Request Tracker (RT3) integration - log particular types of conversations [such as ones kicked off by a customer support call] to an RT ticket
  • Broadcast Mode - Instead of functioning as a full group chat, make all replies only go to the initial starter of the conversation.  For example, you could use this as a real-time roll call in the case of a natural disaster or office emergency.  One person sends a message to an alias that copies the entire company.  Then each person's reply only goes back to the sender instead of having a massive full-company-chat.
  • Vacation - Users can send text commands to notify the system of their upcoming vacation, automatically dropping them out of group pages and on-call rotations for that time, without the admin having to reconfigure DSPS.

There's plenty more but this gives you a good taste of what DSPS entails.

The one other point worth mentioning is that DSPS itself (a Perl daemon) can't fix the Technical issue I mentioned at the start about Verizon or AT&T dropping messages because they think they're spam.  For that, I recommend an SMS Gateway provider, such as Signal HQ.  For a small fee they'll pass your SMS messages to and collect replies back from cells phones without passing through an email spam filter.

As open source, DSPS isn't a product or service that Digital Services officially supports.  But if you have questions I'm happy to try to point you in the right direction.  You can reach me at dsps@ds.npr.org.

DSPS is available on GitHub at https://github.com/nprds/pagingd