Editor’s note: This interview with Lee Atchison was recorded for Coding Over Cocktails - a podcast by Lonti previously known as Toro Cloud.
From what was originally planned as a bookstore in 1994 Amazon has become the world’s largest online retailer by market cap today. In the mid 2000’s the platform that powers amazon.com had to be rearchitected to facilitate the growth of the online store.
Lee Atchison, Application Modernization expert and author of “Architecting for Scale”, was directly involved with the project to migrate amazon.com from a monolithic architecture to a service-oriented architecture (SOA).
In an episode of Coding Over Cocktails, Atchison recalls his time working at Amazon and talks about what businesses can learn about Amazon’s migration and modernization of their systems.
Lee Atchison served as Amazon’s Senior Technical Manager from 2005 to 2006 and led the company’s retail website migration from a monolithic to a service-oriented architecture (SOA).
Whilst compared to retailers like Walmart, Amazon was a relatively small company at the time, Atchison says that Amazon was already encountering problems on the technical side, specifically with their system design.
“The biggest problem they were running into is that they had designed the system so that every transaction that went through the website had to go through this one application called Obidos.”
The problem with Obidos, Atchison says, was that it was built in a way where they “funnel everything into one location”. Everyone had assumed applications would be easier to manage when built that way.
“About a hundred engineers were working on, essentially, some way, shape or form, this one piece of code. And there were other services, but this one piece of code was touched a lot by everybody. They tried deploying it twice a week, but there was always something [that went wrong].”
In the end, the code was rarely ever deployed, which slowed down application development and made changes more difficult.
Seeing that the Obidos platform was no longer efficient, they started to learn that treating each feature and service as an independent piece helped make updates easier without breaking the whole architecture and provided more flexibility.
The migration from Obidos to one based on a SOA was a project called Gurupa.
“The whole idea [with Gurupa] was, ‘Let's build this application and instead of having one funnel point, we instead have an infrastructure where we can plug in all these modules that are all independently developed, independently deployed, independently tested and you know, services, essentially, frontend services as well as backend services. And we'll change the entire website over to this new model and throw away the old model.’” Atchison explains.
Gurupa was Amazon’s distributed system at the time and migration began in January 2005. Atchison says that the migration went on for about two years.
“By the time it was over, I was running the team that was doing the coordination to get all of that activity done and moving over to the new architecture.”
At one point, Atchison says that most of the changes were done during a 24-hour period.
“We had metrics on the wall and phones back to other teams - and yes, we had real phones - and we called back to real people on other teams. We were all in this room together and doing this migration, country by country over the course of 24 hours. All over one night. One whole day.”
Atchison estimates that about 0.1% of internet traffic changed that day as they migrated towards the new architecture. From an outsider’s perspective, he explained that it was hardly noticeable and the migration was executed smoothly. As such, they were able to avoid what they called a “New York Times” event.
“The whole idea is that ‘New York Times’ events were bad things. That’s when there was an article posted in the New York Times that said ‘Amazon screwed up again’ and we tried to avoid those. Our whole goal was to try and make it through the entire day without generating a New York Times event.”
Atchison’s time at Amazon helped develop his passion for SOA. He believes that a SOA plays an important role in building applications that are highly scalable.
“I learned so much about service-based architectures, about scaling, what high-availability really is about and why it's important to maintain availability and what happens when you don't.”
He adds, however, that to ensure the scalability and availability of applications you have to look beyond SOA to the structure of an organization itself.
“Probably the biggest, single thing that's involved in high-availability and high-scaling besides just architecting in a service-oriented way is architecting your organization in the correct way.”
In his book, Atchison introduces the concept of STOSA which stands for Single Team Oriented Service Architecture. The concept of STOSA is about designing development teams and the organization itself in such a way that they are able to build scalable and highly-available applications.
“Building service-oriented architecture is one part of the problem, assigning ownership and defining what ownership means is another part of it. Service level agreements and inter-service service level agreements are incredibly important.”
“I think STOSA is really about the best practices and the methods for how the different teams within the organization interact in order to make that happen.”
Besides ensuring availability, he adds that scaling also involves good risk management techniques in the organization.
“The main thing we're talking about with risk is to understand and plan for risk in advance, so you know what's going on and you can make plans for it.”
Atchison recommends building a risk matrix which helps teams define both their known and unknown risks. Teams must then assign corresponding severities and priorities to the risks. He explains that organizing and categorizing these risks are an important part of the planning process of adding new functionalities later on.
“That's essentially your documentation of your technical data, of your risk plan. Any team that has a risk matrix that's empty that says, ‘Well, we were really, really good. We've got all the problems solved. There's no risk here at all.’ First of all, they're lying, or at least, they don't understand what's going on, but that means they haven't thought hard enough and they need to spend more time to come up with it.”
The goal, he explains, is not to completely wipe out the possibility of having risks in the application but to have as little unknown risks as possible.
“Risk is a natural part of the development process of an application, a natural process of everything. It's just the unknown risks or risks that you don't have a plan for that you want to avoid.”
“So, by recognizing the risks, seeing it, organizing it, prioritizing it and then planning for it, you can be prepared when problems occur. That'll improve availability and that'll improve scalability because a lot of risks fire as you scale up.”
Listen to more of our discussion on building applications that scale with Lee Atchison in this episode of Coding Over Cocktails - a podcast by Lonti previously known as Toro Cloud.