The Phoenix project — Book Notes

book notes - programming process - management

January 19, 2019 • ☕️☕️ 9 min read

The Phoenix project book cover — Showing all the main story characters — can you guess who is who?

The Phoenix project goes through a “fictional” but realistic story of the company “Parts unlimited” which has many conflicts among its different departments, IT, development, marketing, and sales.

And Bill who suddenly got promoted to be the head of IT operations as a result of his two top managers pursuing other options (a.k.a fired).

And his first assignment is to deliver “The Phoenix” project, which is an online ecommerce service for the company so users can make orders and buy car parts online which get delivered to them, the company has high hopes on that project which has been in the making for a long time. And there is a threat that the company’s board will split the company and outsource the development and IT departments if the project fails.

The story goes through how Bill tries to streamline things while going from one outage to another, then the board invites some veteran old guy named Erik to mentor and help them manage their IT department.

And through the book, Erik takes Bill through the “three-ways approach of managing operations”, we will explain them later in this article, but the first thing Eric helped Bill to discover was the “Four Types of work” because without understanding which types of work you are dealing with, you can not optimize them.

#The Four Types of Work: ##1- Business projects: The projects requested by business to deliver direct value to customers and/or move a specific business metric/KPI. ##2- Internal IT projects: Infrastructure or internally generated improvement projects. Often these are not centrally tracked anywhere outside the IT operations department. Which creates a problem as there is no easy way to find out the capacity they are already committed to. ##3- Changes: Generated from the previous two types of work and are typically tracked in a ticketing system.

A ‘change’ is any activity that is physical, logical, or virtual to applications, databases, operating systems, networks, or hardware that could impact services being delivered. There is another side story here from the book, which I think is useful to mention, When Bill came to the team there was a system where people need to go and fill their changes, but no one was doing that as the system was actually so complicated with very long forms, so it didn’t make sense to fill a form for 15 mins for a 10 mins task (at least that was people’s logic).

But the result of not filling in the changes, caused chaos in the department because no one knows anymore the big picture, who is doing what, and what team is busy with 100 tasks and which one has capacity for more projects, and that is most dangerous when something goes wrong and no one has a clue what change caused that failure or outage. So what Bill did was changing the TLs mentality from “people are lazy and not using the change system” to “the target is not to make developers use the change system but to keep track of the changes”, so if that system is complicated then let’s use another simpler way to give the same purpose. That way was just sticky notes where each one has a change and who are working on that and for how long (the birth of trello!)

##4- Unplanned Work:

This type is Firefighting. problems, outages, and other unexpected work that cause the planned work to be delayed.

“Unlike the other categories of work, unplanned work is recovery

work, which almost always takes you away from your goals. That’s why it’s so important to know where your unplanned work is coming from.”

##Work In Process (WIP): At one point of the story, Erik, the mentor, took Bill to one of the company’s plants to explain to him the flow of work and WIP.

These are Erik’s words while explaining to Bill, what and how important is WIP:

“Look down there,” he says. “You can see loading docks on each side of the building. Raw materials are brought in on this side, and the finished goods leave out the other. Orders come off that printer down there.

If you stand here long enough, you can actually see all the WIP, that’s ‘work in process’ or ‘inventory’ for plant newbies, make its way toward the other side of the plant floor, where it’s shipped to customers as finished goods. For decades at this plant, there were piles of inventory everywhere. In many places, it was piled as high as you could stack them using those big forklifts over there. On some days, you couldn’t even see the other side of the building. In hindsight, we now know that WIP is one of the root causes for chronic due-date problems, quality issues, and expediters having to re-juggle priorities every day. It’s amazing that this business didn’t go under as a result. In the 1980s, this plant was the beneficiary of three incredible scientifically-grounded management movements. You’ve probably heard of them: the Theory of Constraints, Lean production or the Toyota Production System, and Total Quality Management. Although each movement started in different places, they all agree on one thing: WIP is the silent killer. Therefore, one of the most critical mechanisms in the management of any plant is job and materials release. Without it, you can’t control WIP.”

##Kanban board

To create a fast flow of work through Development and IT Operations, Index cards on a kanban board is one of the best mechanisms to do this, because everyone can see WIP. It is one of the primary ways manufacturing plants schedule and pull work through the system. You can take most frequent service requests, documented exactly what the steps are and what resources can execute them, and timed how long each operation takes.

##Preventive Work

“Properly elevating preventive work is at the heart of programs like Total Productive Maintenance. TPM insists that we do whatever it takes to assure machine availability by elevating maintenance…. Improving daily work is even more important than doing daily work…. it almost doesn’t matter what you improve, as long as you’re improving something. Why? Because if you are not improving, entropy guarantees that you are actually getting worse, which ensures that there is no path to zero errors, zero work-related accidents, and zero loss.”

##The Constraint:

In most plants, there are a very small number of resources, whether it’s men, machines, or materials, that dictates the output of the entire system. This is called this the constraint — or bottleneck.

Until you create a trusted system to manage the flow of work to the constraint, the constraint is constantly wasted, which means that the constraint is likely being drastically underutilized.

That means you’re not delivering to the business the full capacity available to you. It also likely means that you’re not paying down technical debt, so your problems and amount of unplanned work continues to increase over time.

##The three steps to eradicate the constraint:

1- Identify the constraint. Any improvement not made at the constraint is just an illusion.

2- Exploit the constraint. make sure that the constraint is not allowed to waste any time. Ever.

It should never be waiting on any other resource for anything, and it should always be working on the highest priority commitment the IT Operations organization has made to the rest of the enterprise. Always.

3- Subordinate the constraint.

“In the Theory of Constraints, this is typically implemented by something called DrumBuffer-Rope. In The Goal, the main character Alex learns about this when he discovers that Herbie, the slowest Boy Scout in the troop, actually dictates the entire group’s marching pace. Alex moved Herbie to the front of the line, to prevent kids from going on too far ahead. Later at Alex’s plant, he started to release all work in accordance to the rate it could be consumed by the heat treat ovens, which was his plant’s bottleneck. That was his real-life Herbie.”

##Takt time

“In manufacturing, we have a measure called takt time, which is the cycle time needed in order to keep up with customer demand. If any operation in the flow of work takes longer than the takt time, you will not be able to keep up with customer demand.

How Toyota solved this problem is legendary, During the 1950s, they had a hood stamping process that had a change-over time of almost three days. It required moving huge, heavy dies that weighed many tons. Like us, the setup times were so long that they needed to use large batch sizes, which prevented them from using one stamping machine to manufacture multiple different car models simultaneously. You can’t make one hood for a Prius and then one hood for a Camry if it takes you three days to do the changeovers, right? What did they do? They closely observed all the steps required to do the changeover, and then put in a series of preparations and improvements that brought the changeover time down to under ten minutes.”

##Ten deploys a day:

“Allspaw and Hammond ran the IT Operations and Engineering groups at Flickr. Instead of fighting like cats and dogs, they talked about how they were working together to routinely do ten deploys a day! This is in a world when most IT organizations were mostly doing quarterly or annual deployments.”

“Allspaw taught us that Dev and Ops working together, along with QA and the business, are a super-tribe that can achieve amazing things. They also knew that until code is in production, no value is actually being generated, because it’s merely WIP stuck in the system. He kept reducing the batch size, enabling fast feature flow. In part, he did this by ensuring environments were always available when they were needed. He automated the build and deployment process, recognizing that infrastructure could be treated as code, just like the application that Development ships. That enabled him to create a one-step environment creation and deploy procedure”

“you need to create what Humble and Farley called a deployment pipeline. That’s your entire value stream from code check-in to production. That’s not an art. That’s production. You need to get everything in version control. Everything. Not just the code, but everything required to build the environment. Then you need to automate the entire environment creation process.

You need a deployment pipeline where you can create test and production environments, and then deploy code into them, entirely on-demand. That’s how you reduce your setup times and eliminate errors, so you can finally match whatever rate of change Development sets the tempo at.”

##The wait time:

“The wait time for a given resource is the percentage that resource is busy, divided by the percentage that resource is idle. So, if a resource is fifty percent utilized, the wait time is 50/50, or 1 unit. If the resource is ninety percent utilized, the wait time is 90/10, or nine times longer.”

Erik used this to show why Brent’s simple thirty-minute changes were taking weeks to get completed. The reason, of course, is that as the bottleneck of all work, Brent is constantly at or above one hundred percent utilization, and therefore, anytime we required work from him, the work just languished in queue, never worked on without expediting or escalating.

The x-axis represents the percent of a given resource being busy at a work center, and on the y-axis is the approximate wait time (or maybe more precisely stated, the queue length). What the shape of the line shows is that, as resource utilization goes past eighty percent, wait time goes through the roof.

#The Three Ways: ##The First Way

“The First Way helps us understand how to create fast flow of work as it moves from Development into IT Operations, because that’s what’s between the business and the customer.”

It is about the left-to-right flow of work from Development to IT Operations to the customer. In order to maximize flow, we need small batch sizes and intervals of work, never passing defects to downstream work centers, and to constantly optimize for the global goals (as opposed to local goals such as Dev feature completion rates, Test find/fix ratios, or Ops availability measures).

“The need to continually reduce cycle times is a crucial part of the First Way.”

##The Second Way

“The Second Way shows us how to shorten and amplify feedback loops, so we can fix quality at the source and avoid rework.” “Creating constant feedback loops from IT Operations back into Development, designing quality into the product at the earliest stages. To do that, you can’t have nine-month-long releases. You need much faster feedback.

You’ll never hit the target you’re aiming at if you can fire the cannon only once every nine months. Stop thinking about Civil War era cannons. Think antiaircraft guns.”

##The Third Way

It is about creating a culture that fosters two things: continual experimentation, which requires taking risks and learning from success and failure, and understanding that repetition and practice is the prerequisite to mastery.

The above notes just to summarize the book for myself and even if you read them, I think you should still read the full book, it is very enjoyable story and I highly recommend the audio book (the narrator made it even better!).