14 June 2021
Through an extended investigation of the Website to Council communication issue, the following facts have been determined:
- The website stopped sending messages during a period in December
- It started working again during the last week of December (25th onwards)
- It stopped working again around the 4th of January until the Start of March.
Checking through the logs and background files of the website has meant that we have managed to locate and relocate missing messages. The total number of the messages that should have got through was 172 of which were there were 20 individual service ticket requests. The rest were contact messages which would have needed to be classified with respect to their topic.
But you said you had lost them, how did you get them back?
The situation was not a case of the Council losing the messages, but the council servers had not received the messages. So going back through various 3rd party servers and digging up the logs from the servers going back through the delivery path, we have managed to rebuild the list of messages and extracted most of the contents.
But why didn’t you do this when the issue was resolved?
At the time it was believed that the 3rd Party Server did not have logs and information going back far enough to track down the missing messages. Discussions with the 3rd Parties determined that they did have enough information to progress a reasonable, affordable approach to the recovery of the messages that would not increase the cost to the council (and hence the ratepayer) to extract the messages from the servers and determine which had actually not been received.
What actually happened why did it stop working?
When a form is completed, the Webserver sends an email to the Council for the contact team to process. This had been working smoothly and effectively for a few years. After it was determined that it had stopped working reliably, the following was determined:
The webserver needs to send the emails through a 3rd party mail gateway, due to the secure network structure that the website is based in. In December it appears that a 3rd Party SPAM/Phishing interception service that the council had been using in order to reduce the chances of an email coming in that could result in data issues within the council and had changed the configuration to take account of the reputation of the servers which were forwarding the emails on to the council.
At the same time the reputation of the 3rd Party gateway had reduced due to other 3rd parties using it as a SPAM sending gateway. Apple, Google, and others have all had issues with this over time, iCloud has a particular problem sending mail that is accepted by all servers across the internet at the moment.
What would normally happen at this point, is that the email would be marked as SPAM, and end up in a Junk mailbox which the team also reviewed in case of a false positive result.
Unfortunately what happened was 3rd Party SPAM gateway was silently blocking these emails from coming through, so no alert was raised that the system had blocked the emails from coming in. In email technical terms it was not always raising a 550 error but rather silently accepting and dropping.
It was also determined that the 3rd Party Gateway was also not flagging its attempt to send an email was blocked, when it gets a 550 error. It would just try again later with a different message, when it thought that a suitable period of time had passed. This explained when it started working during the Christmas period.
It was a technical failure, and the combination of issues exposed weakness in the process monitoring to indicate that the process had stopped working.
The main failures being:
All contact messages coming in to the same shared mail environment for the initial contact team to process. The level of messages in this shared mailbox environment disguised that fact that a source for less than 10% of the messages was no longer sending in.
The monitoring solution was dependent on determining that there had been failures to send emails, it did not handle the situation where an email had been sent but not delivery if the failure to delivery the email did not generate and email.
General Process Background for incoming messages:
The contact messages are all sent to a shared inbox environment for the team to process through the messages, provide answers, assign the message to a relevant team to provide answers if the initial contact team could not answer the messages, or assign to the service desk queue for a Council Service response to be allocated to issue. During the same period that the website was not sending messages, the contact team processed 1500 contact messages through the shared mailbox that come direct to the shared mailbox from other email sources.
So what have you done about this:
Since March, the routing and direction of the email from the website (and to be fair the rest of emails into the council) has been updated to go through a different third party solution with respect to the filter of messages, with a suitable level of risk determination set before an email is dropped or blocked. This server also has an increased level of reporting with respect to incoming emails at total council services level.
Continued increased training with respect to handling of suspect emails as in order to ensure that the contact message get through we have to accept a level of risk with respect to incoming emails.
Enhanced backend monitoring at the Website and other 3rd party email transport services, with linked reports reviewed to attempt to pick up when there has been a silent failure of email delivery.
Initial contact processes are under continual review, changes to the approach and structure of the contact team to enable them to also flag up a lack of messages coming in from different sources.
Investigation of more direct methods of messaging council services with technical hocks to provide better delivery SLAs than the use of email provides, that passes a cost/benefit analysis as again it is a direct cost to the ratepayer.