In a hearing before the US Senate Commerce Committee, the Federal Aviation Administration’s (FAA’s) acting administrator Billy Nolen said that the agency has taken steps to help prevent a repeat of the January 11 Notice to Air Missions (NOTAM) system outage. Nolen told the panel, “After the incident, we implemented a synchronization delay to ensure that bad data from a database cannot affect a backup database. Additionally, we have implemented a new protocol that requires more than one individual to be present and engaged in oversight when work on the database occurs.”
Note
- Let’s be clear: the issue was not contractors, it was the failure by the FAA to realize that pilots accessing NOTAM information was a go/no-go for flying and that any changes/update to the data files was a potentially disastrous event. File Integrity Management tools and processes have been around for a long, long time but often are not used on the files and executables that need to be the most resilient.
- Having a two-person rule for changes to critical systems is a pretty good idea. Presumably, they are testing the changes in a production clone environment more than once. At a minimum, make sure that you’ve walked through the rollback process and know how long that will take and how it’s triggered. If you’ve never rolled back a production change, particularly one that requires a database restore or rollback, you need to experience that to understand what’s involved. Maybe more than once. (Don’t ask how I know this.)
- It still comes down to people, process, and technology. People: were the contractors, and FAA personnel for that matter, sufficiently trained to understand and maintain a critical flight system used by the aviation industry? Process: was the update sufficiently QA’ed/tested prior to implementing on both active/backup NOTAM system? Technology: let’s not forget, the NOTAM system is 30 years old; it can be difficult to find qualified engineers to maintain. Which brings us back to people.
Read more in