Carlos Leyva

Silicon Stories

Chapter 5: The Missing Factory

Incident Management

« PreviousContentsNext »

What is an incident? An “incident” can be a bug that has to be fixed, a feature that has to be implemented, a customer question that has to be answered, or any other important item that has to be tracked until it is dealt with. Associated with each incident is a general description of the issue, a priority, a person that it is assigned to, and a status. In addition, an incident may be associated with one or more contacts (e.g. customers) or files (e.g. documentation or source code), and finally, an incident is associated with a specific project.

In the Microsoft universe, a product—Visual Intercept by Elsinore Technologies—is best in class and so tightly integrated with the Visual Studio IDE that it is indistinguishable from one of Microsoft’s own offerings. An incident management product is the “glue” that holds the development process together and an absolute requirement for all but the most trivial of projects.

Effective management of source code and incidents, using world-class tools and processes, make up 80% of the critical must have factory components! Homegrown alternatives should no longer be tolerated as acceptable replacements.

Backup and Recovery

Backup and recovery represents a set of strategies that determine how quickly an organization can respond after a disaster strikes. A disaster can be anything from a hard disk failure on your one and only mission critical web server to an act of God (i.e. tornado, hurricane, flood, etc) that brings the site down. In order to minimize downtime many factors need to be considered:

Usually what drives the decision is cost versus the amount of downtime (i.e. otherwise known as pain) the organization can tolerate. It is my experience that recoverability can be improved dramatically (i.e. vis-à-vis your current state) simply by focusing on low cost, but effective, best practices. These best practices tend to focus on process issues requiring minimal incremental capital expenditures. Providing absolute recoverability requires increased complexity and cost and is usually not warranted for work performed by development groups. However, if you are responsible for Etrade’s or Amazon’s production site, then that is a completely different story.

Here is a list of common sense practices:

Once you have a working process, destroy shit on purpose from time to time to ensure that it is actually working. Obviously this should be done in a controlled manner, but it must be done. The most elegant backup and recovery procedures can develop insidious bugs that go undetected for long periods of time. The only way to prevent this from happening to you is to test it often.

« PreviousContentsNext »