GitLab’s Journey From Azure to GCP and How We Made it Happen (Cloud Next ’19)

ArticlesBlog

Written by:


[MUSIC PLAYING] TAD EINSTEIN: So
today is going to be a very interesting session. So GitLab, as you may
or may not be aware, migrated from Azure into GCP. And the goal of this session
was a complete, honest session of why they did it,
thoughts behind it, the pros, the cons, and really
help so that you can understand what a real-world company does
who has a lot of customers when they’re looking to do a
migration of this sort and a lot of the considerations
that go into play. So first some
quick housekeeping. We have something
called the Dory. So the Dory is
what’s going to allow you to ask questions and
answers throughout the event. So feel free to
log into your app. You’ll see right there
where it says Dory Q&A. And then you can ask questions. At the end, we’ll
answer the questions. As well as you can
come up to the mic so we can dive a bit deeper. So with that being
said, let’s get started. So first thing I’m just
quickly going to touch on is Google Cloud’s migration
path and our methodology. So whenever you’re doing
a migration, whether it’s on prem to on prem, on prem
to cloud, cloud to cloud, or whatever the future
is going to bring, you need to have a very
methodological approach– so the idea of assessing
what’s in your environment. Anyone who’s ever done
this and tried just moving something to the cloud
often breaks 20 other systems. So it’s very important
understanding technically what’s there, the cost when
you’re going to move it, but also how are all the
things are interconnected, because you want to move
things in the right order and the right move groups. So you assess. You understand what’s there. The next step is setting
up your landing zone. So what environment
is it going to look like when you come
to Google Cloud so that you can begin migrating? So you set up things
like your Identity and Access Management,
security, various policies, connectivity, maybe
back to a data center and additional cloud. Once that’s all set
up, then you can do the fun part, which is
actually migrating and moving to the cloud. When you’re in the
cloud, the next step is optimizing, because usually
there’s a reason people move to the cloud. And there are a lot of benefits
around operationalizing and taking it to the next level. So with that being said, I’m
going to bring Brandon up onstage, and he’ll talk a bit
about how GitLab got there. BRANDON JUNG: Thank you,
Tad, so much appreciated. TAD EINSTEIN: Thanks, buddy. BRANDON JUNG: So
today we’re going to go through a
couple of things. We are going to talk– we’ll talk a little bit
about our application. We’re going to talk a
bit about the foundations and how we went through
the process of moving over from Azure to GCP. We go through migration. And we’re going to touch
a little on optimize. That’s almost a
secondary subject, but we’ll hit this in
a couple of details. So a couple of things. My job here is to give you
a context of what GitLab is for a very simple reason. GitLab.com or
GitLab as a company, we’re a single DevOps tool. So we cover everything
from how you go down, how software developers
write software. This is what they
live in all day long. They go from the planning
ideas, writing their software, testing it for quality, ensuring
that it has great security, packaging up and deploying
it in to production, and even eventually
monitoring it. So this is what GitLab does. What we’re talking about
today is GitLab.com. So there’s two primary ways
that we think of a business. We have GitLab.com, which
is our SaaS offering. And that’s what we’re going
to spend most of today around. We also have GitLab, that most
of our customers, the majority our customers, still love to
manage their own tool set. They run it in their
VPC on prem, on GCP, in Kubernetes, all of the above. So those are the two
different worlds. What we’re going
to do today is walk through what the migration was
for us on our SaaS platform. So as we go through it
on the SaaS platform, it’s important to have some
context on where we’re going and what the business drivers
were for us as a SaaS provider. That said, here’s some
context for us as a company. So we’ve been incorporated. We’re now about fifth year. We have 500 employees,
50-plus countries. We are a fully remote company. So there’s not a single
office in anywhere of GitLab. You’re going to get some
context as to how that works. We have very, very
broad adoption. So our business
strategy is a strategy called user-based open core. So if you’re interested in
this, you can take a look, search it on Google. Search our CEO, Sid. He talked through all
the different models that an open source
company has to consider. And it is one of the more
challenging pieces to solve. And this is what we’re
trying to get to. You get some really good
context here from the growth. So same in terms of what we’re
dealing with from a growth standpoint, we’re now pushing
about 10 million repositories running on GitLab.com. So big repositories. As we made the
move, some of what you need to
understand with GitLab is we also have a whole
bunch of other tool pieces in here that matter. So if we go back to this and
take a look at your pipeline, well, we’re not running
everyone’s pipeline. And that’s a very big use case. It’s a lot of compute. That requires that
you do it really well. So all of these, as our customer
base grows, is critical for us to do it. We obviously move
pretty quickly. We have about 2,200
regular code contributors. We ship every month. And from that
standpoint, we also have another 10,000
total contributors over the time that
have worked with us. Now, again, this
is a meta process. So what you’re going to see
is we took GitLab as a tool. That’s where we live. That’s what we breathe. GitLab is used as a migration
tool to migrate GitLab.com. So to give you a
little context of what GitLab is, because
you’re going to have to understand what
we had to optimize to in order to deliver it. So to that end, the
prerequit slide, I’ll just give a
little context to this. This is just all the different
organizations that care. Well, why does that matter? Because for us, more
and more of them are running on a SaaS offering. More and more of them are
looking and either running or considering a
run at GitLab.com. It has to be mission critical. We have to deliver this
perfect every time. But at the same point, we
use the exact same code base to run GitLab.com as we
do shift to our customers. And that’s quite different. So that methodology
is both a plus, but it also means
that we have to meet– it allows us to dog food
the entire process through. Our focus for our customers
is you’ll see what we do– our customers like
Goldman Sachs. So Goldman spent for us,
they moved over to GitLab, came on with about
1,000 employees focused on really optimizing and
changing the way they develop. They’re obviously a good
development company. Is anyone from Goldman here? If you ask anyone from Goldman,
they’re a software company. That is what they do. When they tackled
this initially, they said, OK, we’ll
go with 1,000 users. We’re going to roll that out. We’ll get into nine
months to a year. We’ll let you know in nine
months to a year how it goes. Inside of a month, they
needed another 5,000 licenses because they rolled
it out to everyone. So easy from the roll it out. We need to make sure that works
really well for GitLab.com as well. So that process, critical
to our business success. Couple of things here. I’m going to lay some context. These are our core values. We believe deeply in the
notion of collaboration as an open source company. Couple things here, because I’m
more interested in showing you how this works. When we say that we
believe in collaboration, one of the important things
you want to know is, hey, who do I need to work with? Well, for reference, if you’re
interested, at GitLab.com, we publish our entire org chart. So you’re going to know who
you want to work with, how, where, et cetera. So in order to do that
well, obviously, we need to be able to be
very, very efficient. And we also, of course, will
get into this a little bit around the transparency. Now the efficiency
and cycle time, this is something
that matters in what we did, and it matters
to our customers. So both these are
really critical in being able to build software quickly. We’ve all heard
Marc Andreessen say software is eating the world. Got it. The more interesting thing
is his most current point is cycle time compression may
be the most underestimated force in determining who are the
winners and the losers in tech. This is the core of what a
digital transformation is. And for us, we’ve got to do
this because, as you saw, while we’re a 500-person
company, that’s wonderful. We’re doing great. We enjoy it. End of the day,
I’ll be very blunt. Microsoft’s our
biggest competitor. With that, our superpower at
GitLab is a very simple thing. We will iterate
faster than anyone. If we can do that
well, we succeed. If we don’t, we lose. Full stop. And our customers
expect the same. Couple of other fun
things here that are always worth looking at. On a transparency standpoint,
let me make sure we do this. Let’s do this. It’s always interesting
to say, hey, what’s the roadmap look like? So from this standpoint, if
you’re ever interested exactly on how we ship all of
GitLab– and again, GitLab, a full tool end-to-end suite– we publish our entire roadmap. We publish exactly
when we’re going to go, where we’re going
to go, what ships when. And it’s full feedback. So regularly, when we’re looking
at deciding and prioritizing, anyone– is anyone
a GitLab customer? OK. Thank you. Awesome. If you guys want to prioritize
where you want a next feature, it’s an issue. It’s open. It works with
everyone in the world. And we obviously value add. That’s something that allows
us to move much, much faster. One other area here that
I’ll probably look at, we also look at transparency
just as a culture. So I’m going to give
you this a little bit. This is a handbook. Any startups in the room? OK. Anyone ever tried to deal
with a remote culture? How do you manage
a remote culture? What are the right processes? This right here is a
handbook that the majority of YC companies that want
to go– this is a handbook. It’s 2,000 pages on everything
how we run the company. So we literally write
every single thing down. Because we’re 100%
remote, decisions aren’t made in a room. They’re not made
over a water cooler. They’re made and
start in an issue. We live in an issue. We iterate an issue. And we move that
quickly forward. So in this, to give you
a sense of just sort of a little bit different than
maybe the world works, anyone been in the black
art of acquisitions? You probably haven’t been. If you have been, like
part of what my job is, decided to maybe take this
a little bit different. We decided, you know
what would be great? Why don’t we go ahead
and give everyone an exact idea of how we
go through acquisitions? And we’ll go through. I can’t drive as well
backwards as I would like to. But we go through
and tell exactly, hey, any company that we
would like to work with us, here’s exactly
how we would work. Here’s how your first
days would work. Here’s how we value you. Here’s how we’d bring you in. And so that transparency
is something that, again, works across here
and the rest of the company. So let me jump back in
and click to the next one, because the values are useful. This is what we
had to deal with. We start as a company more
or less in this create space. So when we had to run
our SaaS offering, we’re only needing
to deal with Git. It’s interesting. It’s really powerful. It’s our core. But now as we’ve expanded
and we deliver this level of complexity, we have to be
able to– at a regular basis– be able to run that for
every single customer. So that means we have
to run pipelines. It’s going to mean that we
have to have a great managing. We have to manage and run
all of this extremely well. So companies expect to be
able, if they live in a silo, and when they come to
GitLab, they’re asking, hey, how do I break
down those barriers? How can I run and
develop concurrently instead of sequentially? That was an issue for us. It was a prize for us. So as we went through
the process here, we used to, of course,
have a bunch of other tools too, because we only
started in the Git space. When we moved on to
planning, CI, monitoring, we were surprised
actually ourselves when we put these things
all together exactly what the compression
of that cycle time was, which is wonderful. But when we run that product now
as a SaaS for every customer, that thing better run
perfectly every time. Because they are betting– our customers are betting
their business on GitLab.com as the ability to ship
product every single day. So that’s the part
that we had to make sure we executed perfectly. One other thing. We talked a bit
about transparency, which obviously matters. But I’m going to– you’re going
to see a theme here throughout. That’s fine. I’m going to move to the
next slide here real quickly. Give you a little context of
what we had to migrate over. So when we moved over,
we had to move everything from the source code. But we had to move
all the Kanban boards. We had to move our
project management. Likewise, we wanted to roll
out– when we made this, we want to make it easy to do
things like, if you look right in the middle, review apps. Want to make it easy
for our customers to be able to deploy
new apps exactly how it’s running instantaneously. And so that was the
last part we did there. That said, I’m going
to quickly get it over to Andrew, because you guys
want to know why we moved over. We’re going to spend
most of our time– some people will know GitLab
as the left-hand side here. As an end user, that’s
mostly what the UI UX is. Today, the rest of it, we’re
going to spend on this part. How do we migrate all
the different pieces of the complex application
that is GitLab? Couple of really important
things that we focused on. First off, we wanted to
be suitable, obviously, for mission critical. That’s really important for
us to grow our business. We needed performance. We needed consistency in price. That was the reason we
moved from Microsoft. We saw better performance–
we’ll highlight that– better consistency, and
significantly better price. The other part, if you’re
thinking about the other part, was culture. So more and more development
is landing in cloud data. If you’re here, you’ve
heard this all day. You heard Anthos. I don’t have to
explain any of that. However, for us, when we want
to work with a partner that’s going to help us
understand and work deeply around those technologies,
there’s not a better place for us to go than was Google. And so that was part
of why we moved. Last one was– and that hit
in part two and part three. So part three, a lot of
those are technologies, that while we use
them very deeply, the advantage for us
as a company worked with Google early was we
have real deep insight into how that works. Likewise, Google uniquely
was willing to walk us through exactly how
Google developers develop. So we get kind of a future
look into perhaps how other large
companies, or probably one of the fastest shipping
companies in the world, actually does their
DevOp life cycle. That said, the last one
I’m going to turn over to Andrew here in a second. And last one, I think, speaks
pretty much for itself. We’re a small company. In doing so, we were going
to write all up about it. Andrew’s going to tell
you how we did it. And I think that’s
pretty helpful for us. So with that, I’m going
to let Andrew chat. This is probably who
you wanted to talk to. And much appreciated. ANDREW NEWDIGATE: Thanks. Thanks, Brandon. My name is– [LAUGHS]. My name is Andrew Newdigate. I’m an engineer at GitLab. I helped lead the team
that moved GitLab.com from Azure to GCP. Now, Brandon told you
a little bit about why we wanted to move to GCP. And I’m going to tell you a
little bit about how we did it. And I need a clicker. Cool. So the goal of the project
was really, really simple. It’s basically move
GitLab.com to GCP. So we could state
it really simply. The problem was how could
we go about doing this? Now, Brandon discussed our
company values earlier. I’m going to try
to illustrate how we use these company values
to apply them to the way that we work. And in particular, I’m going
to focus on efficiency, or how we love boring
solutions, iteration, or how we focus on shipping
the minimum viable change, and finally on transparency. So the first
iteration of the plan was really a combination
of a really boring solution and a minimum viable plan. So we considered whether we
could just stop the whole site, copy all the data
from Azure to GCP, and then switch the DNS
over to point at GCP and then start
everything up again. So the problem is that we
had too much data to do this within a reasonable time frame. So we had, at the time,
about half a petabyte in total of data on Azure. And once we’d shut
down the site, we’d need to copy all of
this data between two cloud providers over the internet. And then once the
copy was complete, we’d need to verify
the data and make sure that it was all 100% correct. And only then could we
start the site up again. This plan meant that GitLab.com
could be down for several days on end. And considering that thousands
and thousands of people rely on GitLab on a daily
basis, this didn’t really work. So we went back to
the drawing board and started considering
what our options would be. So around about
the same time, we were working on another
feature at GitLab, and that’s called Geo. And what Geo allowed us
to do is develop a feature for our enterprise
customers so that they could replicate a GitLab
instance from one site to another. And this is really
useful because you can use it for faster
access at the second site. It’s also redundancy. And you can use it for
site disaster recovery. So Geo works by doing an
initial synchronization of all of your data from the
first site to the second site. And then afterwards,
it makes sure that everything
is synchronized so that any change
on the first site is replicated to the
second in near real time. So you always have a
duplicate copy of your data. Now, GitLab.com is
not a fork of GitLab. GitLab.com is GitLab. And so Brandon mentioned that
on the 22nd of every month, we do a release of GitLab. And the first place that
that release is deployed to is GitLab.com on the
7th of the month. So two weeks before we
release the software, we deploy it on our own
servers on GitLab.com. And we even use the
same package installer, called GitLab Omnibus, to
install onto GitLab.com. And this means that
any feature that we build for our self-managed
customers is also available for us to
use on GitLab.com. And so we hoped that by
utilizing these replication capabilities that we
were building for Geo, we could migrate the
entire GitLab.com site to a secondary instance in GCP. So a new plan was formed
using Geo for replication. And the secondary environment
would be GCP, as I mentioned. And then what we would do
is synchronize all the data. And this could
possibly take weeks. Or it could take months. It didn’t really
matter, because the site would be available throughout
the synchronization process. And then once all the data
had been synchronized to GCP, we could verify it and make sure
that it was all 100% correct. And then we could just
promote the GCP environment and make it our new primary. This had many advantages
over the initial plan. So obviously,
firstly, GitLab.com would be up for the
synchronization. And then we would have
a period of downtime, but it would be very short,
maybe an hour or two. And the other thing that
was great about it is we could time box it. So we could know
exactly how long that downtime was going to be. Then, since the new
environment would be running all of our production
data, we could verify it. We could do QA. We could build automated
QA tools, manual QA tools. We could do load testing on it. And obviously, we could
verify all of the data before the failover. And then finally, because
we had this migration, we could use the migration
as an opportunity to do ultimate QA of GitLab Geo. And so basically, if it could
work for us on GitLab.com, it would pretty much work
for any other customer who wanted to use Geo. And we could be
confident in that. I’m just going to quickly grab– BRANDON JUNG: And that
goes back to the iteration and the testing. And that goes back to our value
of iteration, testing, dog fooding. We make sure before
our customers touch it. We’re going to make
sure it runs great. And then it gets into
the largest deployment in the world, which means if
you’re running it yourself, you can be fully
confident that that’s something that’s ready to go. ANDREW NEWDIGATE: Cool. So around about this
time, we were actually working on another
major project, and that was called
Cloud Native GitLab. And this project is
basically about building a set of Helm
charts for GitLab so that we could deploy
GitLab inside a Kubernetes environment. And much like Omnibus is our
official package installer for installing GitLab outside
a Kubernetes environment, the Helm charts in
Cloud Native GitLab is our official
installer for installing GitLab inside a
Kubernetes environment. So instead of provisioning
the new environment using Chef and Omnibus,
we thought maybe what we could do is
deploy it into Kubernetes using Google Kubernetes
engine and our Helm charts. And so we evolved the plan
with another iteration so that we used Cloud Native for
setting up the new environment. The problem was
that this approach– it became apparent
that there were problems with this
approach as we went along. So the changes that we needed
to make to GitLab.com– at least to GitLab the product– were extensive. And we realized that there was
going to be some major rework. And so it was difficult to align
the time frames of the Cloud Native GitLab project
and the GCP migration in order to get them both
delivered within the time frames that we wanted. So we went back, and we sort
of went to the next iteration, where we went back to using
Omnibus for provisioning the new environment. And we also realized
that it would be worthwhile to migrate
as much data as possible to object storage with
Google Cloud Storage. And so obviously, a lot of the
data on GitLab is Git data. But there’s also huge
amounts of non-Git data– build artifacts, CI
repositories, CI images, that sort of thing. And what we could do is take it
off NFS, where it was stored, and move it into object storage. And doing this would allow us
to reduce the risk and the scale of the Geo migration. And additionally, moving
that data into object storage and eliminating
NFS from our stack meant that in future, when
we moved to Kubernetes, it would be much
easier because we didn’t have to worry about NFS. So the steps for the
migration were now fairly straightforward. We would provision
a new environment with GitLab Omnibus,
and we would set it up as a Geo secondary in GCP. And then we would replicate
all the data from GitLab.com in Azure to GCP. We’d wait for the initial
synchronization to complete. And then when it
had completed, we’d test the new environment with
automated manual testing tools. And we’d obviously verify that
all the data is 100% correct. And then finally,
once that’s done, we’d failover to
the GCP environment, promote it to primary,
and we’d be done. So we began to set up
the new environment. And we started the
replication process and just started
replicating non-Git content to object storage. Now, being remote means that
we can’t spend our time playing sword fights in the hallways
while we wait for things. So we used the time
to be productive, and we started
implementing other features in our new environment. So we set up Identity
Aware Proxies for all of our
internal web apps. We set up Cloud Key Management
Service for security. And we also did a major revamp
of our logging infrastructure, moving to structured
logging wherever we could and switching to fluentd and
Google Pub/Sub for logging. And for short-term logging,
we use Stackdriver and ELK. And for long-term logging,
we stored our data in Google Cloud Storage. So there was only one major
unknown left in this plan, and that was the actual
failover operation itself. So unfortunately,
Geo, at the time, didn’t support a
failover operation. So nobody knew exactly how
we would go about doing that. So we needed to define
a step-by-step plan to carry out this failover. Now, considering the
stakes were so high, it was essential that we got
this right the first time that we tried it. And how did we ensure this? Well, we just kept on iterating. So we set up a failover
procedure as an issue template in GitLab. And then each step in the
failover was a checklist item. And then we started
practicing the failover. And every time we
did a practice, we’d create a new issue
from the template. We’d run through it. And anything that
went wrong would get fed back into the
template as a merge request. So the whole team
would have a say, and we could discuss it
on the merge request. And through this very
tight feedback loop, we could rapidly improve
the failover procedure. So this animation over here
shows the main failover documents. And there were over 140
changes in that documents. So every time we
tried something, we would submit a change. And over time, the document got
longer and longer and longer. And when we started,
every time we tried it, it would basically not work. But every time we did it, we
would improve it a little. And we kept doing this until
we were totally confident that the failover would work. And then when we were
confident, we actually introduced another
level of difficulty by starting to mess
with the failover, having somebody secretly
shutting down servers or rebooting Redis or shutting
down [INAUDIBLE] primary. And then the team
would have to recognize that that had happened
and take it into account– detect it and improve
it and then carry on with the failover. And once we were
doing that, we knew that we were totally
confident and that it was time to carry out the failover. So we let Google know, and
they assembled an amazing team to help us with the failover. Sorry. I’m just going to
have some water. BRANDON JUNG: One fun note
on that previous slide, if you’re ever interested
in exactly how that works, it’s an open issue. So anyone that’s
interested, you can watch exactly how we cut it over. You can go look on exactly how
we went through that process. And so you’re going to detail
by detail, ad nauseum, what that kind of work takes. So if it’s something you
all are trying to solve, again, we put all our
documents out there because we want to
collaborate and help everyone else along the
same journey we’ve been, and obviously get your feedback. Not necessary on the cutover. That’s long been since done. But in the rest of what we do. So it matters an
enormous amount to us. Hopefully that becomes something
also that’s a usable document. And I think pretty unusual
that a company is live stream how they
make the migration, as well as giving everyone else
visibility on how you might be able to do it yourself. ANDREW NEWDIGATE: Cool. Oh, sorry. So I skipped a
little bit farther. So Google assembled an
amazing team of people to support us on the day. And we knew they’d be available
if we needed any assistance. Luckily, the failover
went really smoothly, and we didn’t experience
any major problems. On the Monday following the
failover, as traffic started rising on the site, we started
seeing some performance issues. And working with
Google, we realized that our Gitaly servers,
which run all the Git operations for GitLab, were
struggling a little bit. And we spoke to Google. They suggested we move
from N1-highmem-8 machines to N1-standard-32 machines. And so we rotated the
fleet out to those. And after that, it was
plain sailing the whole way. So going back to the
goals of the project, did we make GitLab.com suitable
for mission critical workloads? So firstly, let’s consider
availability on GitLab.com. So as an external
tool, we use Pingdom to monitor the site
externally and tell us if it sees any failures. And so I pulled the data
from Pingdom from before, when we were in Azure, and then
afterwards, when we’re in GCP. And this is a
graph of the number of errors we saw per day
in those two environments. So that is a few months before. And that is everything
after the failover. And the blue background– I’m colorblind. I can’t see it being
blue in this image, but hopefully it is. The blue background
is the period when we were running in GCP. So prior to our failover, we had
about 8.2 Pingdom errors a day. And after the failover, we
had just one error a day. So Pingdom also
reports availability. And this, you can see,
is our availability in the year leading up to
the migration is 99.61%, and afterwards was 99.93. Now, some people might
think that those two numbers are really
similar, but they’re not. 99.61 is 39 minutes
of downtime a week. And afterwards, we’ve seen
7 minutes of downtime. So we still have some
improvements to make. But it’s a lot
better than it was. So the other way that
we can compare the sites is with the performance
of GitLab.com before and after
moving to Azure. So one way that we
could do this is with latency histogram, which
shows us the performance of requests to GitLab. And what I did for
this was I took data for one week before the
migration and one week after the migration so
that there weren’t too many other changes
to the application. We were pretty much
comparing like with like. And what you can see is that the
GCP line drops off much quicker and has a much, much
lower tail of values. And so what this shows us
is that our GitLab instance in Azure was slower, and
our GitLab instance in GCP was much faster. And we also had far fewer
unpredictable values, so random values that took 30
seconds or random requests that took 30 seconds. And then finally, we
have, as Brandon said, we have all the documents. We have the issue tracker. We have the project
documentation all available over here for
everyone to take a look at. [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *