Lessons from the private cloud
Connecting state and local government leaders
Ahmed Mahmoud, chief information officer of AMD, which has run a private cloud for about 10 years, talks about his experiences with data movement, latency, licensing and security issues.
Everybody seems to be talking about the cloud these days. The computing model provides an on-demand resource for network access that allows users to tap into a shared pool of configurable computing resources, such as applications, networks, servers, storage and services that can be rapidly provisioned. Advanced Micro Devices, a top chip manufacturer, has been in the clouds for years, offering its 10,000 engineers on-demand infrastructure and software services. GCN spoke recently with Ahmed Mahmoud, chief information officer of AMD, about lessons learned that agencies might find useful. — Rutrell Yasin
GCN: Why did AMD turn to cloud computing?
Mahmoud: One of the unique things about our situation is that we are a high-tech company that has a huge demand for compute resources. So we have a huge need for a high-performance computing environment.
Also, when you start looking at your data centers and environment differently, I would say five or 10 years ago most [chief information officers] did not know how much they spent on electricity. Now we worry about the footprint we have, the amount of power coming in. How do I maximize compute capabilities with a certain set of power, a certain amount of square footage? What we are discovering, sometimes, is that square footage is not the issue, but instead how much power you bring in and how much heat you dissipate.
So those are the kinds of things, when you get into large computing environments, you start thinking about.
Were there other drivers?
Now the other interesting piece about cloud computing — if you are running 20,000 to 30,000 servers or 100,000 cores — what you say to yourself is that you write your applications in such a way that any one of them going down is less than 1 percent of your processing needs. So your service model changes.
Related story:
Instead of saying, I need to have [around-the-clock availability] for every server, you might say it is OK for the vendor to replace parts every two to three days. All of a sudden, the full cost of ownership across the board gets reduced. Because if they don’t need to show up in the middle of the night, there is a business model by using cloud computing to reduce the entire cost structure for us.
AMD has a cloud developed internally. Is it a private cloud, and what delivery models are you offering — infrastructure as a service, software as a service or platform as a service?
For us, it is infrastructure as a service and software as a service internally, where we’re coming in and saying, “You have these jobs that you need to run. You just throw them toward us in this general environment.” We will take it and absorb it and get it running in the most ideal position.
So we have created an environment where the users are not aware of where their [jobs are] going to run, but all of the bits and pieces that they need are there.
The challenge for us is because we drive our utilization pretty hard and pretty high, going to a public cloud is not that cost effective.
What is the underlying infrastructure for your cloud?
Pretty much what we use is homegrown, and it is running on Linux. When you’re an early [implementer], you have to roll out your own [technology] until you’re ready to see how the technology standards have evolved and, as they do, how to jump in.
How many years have you had a cloud?
We have had it for many years. We didn’t call it cloud, but when we looked at the characteristics of it, [we’ve had what could be called cloud computing for] probably 10 years.
We have had our own set of experiences of running a private cloud — [the things] I need to worry about. I need to worry about licensing, latency, data replication and data movement. So now when we talk to some of the cloud vendors, we are asking questions we had to solve ourselves related to our own private cloud. So it has given us a certain level of expertise on how to run a large [server] farm on our own.
How did you handle those issues, such as licensing?
We had to create our own layer to have a layer of intelligence. Metadata becomes important. It is like the old days of IBM Job Control Language. All of a sudden, here, when you have applications running, you need metadata associated with the applications.
Who are you? How much memory do you need to use? What licenses do you need to use? What resources do you need? Then, how do I do a match and fit [these functions] in a real-time environment? I have to know where you are and where is your data located to make sure I don’t have to migrate data in the process of getting you [resources].
And how did you handle the issues of latency and data movement?
The latency depends on where you’re sitting and the jobs we do for you to get the response times you need. Let’s say you’re working on something that needs a visual representation. You’re looking at a real-time interaction. So you’re saying, because of that, I need to make sure that these jobs are running close to you. I need to [find out whether] geospatially are you close to the job and is the data close to you. The easiest [environment] to run is one big farm where you don’t fragment.
The data movement? You have a job that looks at gigabytes of data, and the processing is done on one side, the data is in a different data center. You need to make the calculation [of] what is easier to move. Is it a processing job, or do you need to replicate the data because you have more jobs than one data center can [handle]. Those are the complexities you have to bring to the table.
How did you deal with issues of reliability and security?
For us, because it is sort of a private cloud under our domain, layered security is something we historically have done. What we have not done yet is figure out how to mix our private cloud with a public cloud. And one of the things missing is standards.
It is like the old mainframe days. If you wrote something that ran in MVS [mainframe operating system] and if you had something running in Unix, you couldn’t move things around. Today it is similar. If you have designed something to run under Microsoft Azure or Google or Amazon, there is no real interoperability — taking an application that runs on one and move it to someone else.
Some people are moving e-mail [to the cloud]. E-mail has its own level of complexity. For instance, I have meeting rooms. How do I put all my meeting rooms in the mail system to do calendaring? So integrating your own internal infrastructure — not that it can’t be done — is something that has to be thought about.
So you haven’t done e-mail in the cloud?
Not yet. We’re thinking about it. But for me, my compute farm is really where the maximum benefit is. Ninety-five percent of my capital spending is in my compute environment. So we focus a lot on optimizing it, making that more efficient versus saying, "I reduced my e-mail servers from 20 to 15." It’s interesting, but it is much more important how I deal with the thousands of servers and making those more efficient.
Getting back to mixing and matching clouds, how do you deal with that?
Right now, I’ve avoided the problem by dealing with it internally. One of the beauties that has happened in the last 10 to 20 years is interoperability — the ability to say, "I’m writing my application, and I can run it on multiple hardware [platforms]." I don’t want to get myself back to the old days. If I need to move [applications], it is a porting effort. The moment you put out the “porting” word, that means real work and money have to be spent.
NEXT STORY: BP spill's lessons about data and transparency