What’s new for Storage Spaces Direct | Windows Server Summit 2019

ArticlesBlog

Written by:


>>Hi, and thanks for
tuning in to this session all about what’s new in
Storage Spaces Direct. I’m Cosmos, and I’m
joined in studio today by Adi from the Storage
Spaces Direct product team. Adi, thanks for being here.>>Thanks for having me.>>So tell us a little bit
about what you work on.>>Yeah. So I’m a PM
on Windows Server, and my job is to make sure that
Windows Server is safe and reliable and that involves figuring
out how to monitor issues, improving the health service, improving our diagnostics
tool so that, in the end, troubleshooting is
very easy and faster for you.>>So that’s a pretty
important charter. I’m sure it’s been
a huge focus for the team.>>It has. People have asked for
a lot of visibility in this, and so we’ve delivered a lot
of features for this in Windows Server 2019,
as you’ll see soon.>>So one of the things with Storage Spaces Direct
that works a little differently from
a conventional cluster, and this is true really with any software defined storage
regardless of the vendor, is when you restart a server, now, there’s this different behavior
where it needs to re-sync, right? Can you tell us a little bit
about what that is?>>Yeah. Really good question. So in Windows Server 2019, we added a lot of
visibility into this. To explain this functionality, let’s start with
a very simple example, to go over what re-sync actually is, so that we can make sure that we have the same shared understanding. Suppose that we want to
store the Word Hello. Here, we see the bit encoding
for the Word in binary. So we currently have
a single copy of the data. If we enable three-way
mirror resiliency, we’ll see three copies of the data. Now, suppose, for this example, that one server is temporarily down and can’t be accessed right now. Because of this, we can’t
modify the data in copy one. But now, let’s say, we
want to update our data. We want to update
the string Hello to Help. Now, our bit encoding
obviously changes, and since server one is still down, we can’t update copy one, but we are able to
update copy two and we are able to update copy three. But now, you’ll notice that copy
one is stale. It’s out of date. That data in copy one needs
to be re-sync from copies two or three once server
one comes back online. So what we do is we store the state in the dirty region
tracking, so that, it’s helpful, when server one
comes back online, we can check it and read
and overwrite the data. Now, this provides
a working explanation of how re-sync actually works. But as an administrator, what do you actually see? Well, let’s take
that same three nodes setup that we’ve been discussing. The first thing that you’ll see is server one going down
for maintenance, and when server one
comes back online, you’ll see it slowly re-syncing
that data from other copies. This will take some time,
but when it’s done, you’ll see that all of your
storage is back in sync.>>Okay. So that’s
reasonably intuitive. As an administrator though, in Windows Server 2016, I believe you had to run
like a PowerShell command. You had to say, I
think, Get-StorageJob. That was kluge. Has that changed?>>Yes, it has. So you’re right. If you look at the output
of Get-StorageJob, it’s really difficult to know
when re-sync will finish, how much data is actually
being sync per server, and which servers are in-sync. So the first thing that we added was giving you more clarity into what’s actually happening under the hood. So in Windows Server 2016, you would see that
your server was up, then down, then back up again. But in Windows Server 2019, you’ll now see that your server’s up, down, re-syncing data,
and then back up again. The next thing that we did was
Health shows you a warning in Windows Admin Center and in PowerShell when
you run Get-Health fall, and this warning tells you when
any data is being re-synced, and the last thing that
we did was if you go to Windows Admin Center
under the server view, you can see, on the bottom center, in the storage graph, how much data needs to
be re-synced per server. So you can monitor this
on a per server basis and understand when your storage
re-sync is actually complete.>>Now, that’s much
more intuitive, right? So I can see it directly
from the dashboard. I don’t need to drop into the command line and I
can think about it on a per server basis as
opposed to having to read the tea leaves
with a bunch of jobs. That’s a huge improvement.>>Thank you. Yeah. That’s
exactly what we wanted to do. It was a hard problem, but we wanted to give you
more visibility into the system.>>So storage re-synchronization
is a software concept. But a hyper converging infrastructure
with Storage Spaces Direct is actually a combined hardware
software system, right? So on the one hand, you’ve got four or eight or however
many servers running Windows Server. But you’ve also got like
potentially tens or hundreds of drives whose performance is critical to making sure
that the system works well, and hardware does
eventually fail of course. So how can I look out for that?>>So there’s a two-part
answer for that. The first is that, you are right, drives are expected to fail all
the time and that’s why we built some new visibility in Storage Spaces Direct to give
you more insight into this. There’s two cases. The good case where a drive is working well and then all of a
sudden the drive starts failing. That’s good because you have a clear point in time when you
know something goes wrong. The troublesome case is when
the drive is perfectly healthy, then at some point you notice
a small performance issue, then the drive disappears
but reappears, and you start noticing
some weird app issues, and you don’t know what’s
going on and a little later, things seem to be back to normal. Now, that’s definitely
not easy to troubleshoot. It gets even more complicated. If you start trying
to troubleshoot this, you’ll first realize that there is
a lot of different media types. There’s HDDs, SSDs, PMEM. On top of that, there’s
different bus layers and protocols to access this. There’s SATA, SAS, NVMe. Each of these has provided more insight into helping you
understand what’s going wrong, but none of these solutions
are standardized. So if you look at S.M.A.R.T. Counters, they were made for
SATA only and they’re made to log some of the IO
issues that you might have. But like we said, it’s only for SATA. In order to look at this data, you actually need to go to the manufacturer
tooling and look into it. So if you have drives
from different vendors, you have to download specific
tooling for each one. Okay. So what about if
you look at the sum log? So there’s the Device
Statistics Log which is for SAS drives only and
that’s also not standardized. For NVMe drives, there’s this thing
called critical warnings. But again, that’s only
available on NVMe drives. Actually, one thing I want
to notice, for NVMe drives, if you actually read
their white paper, you’ll see that they labeled these, their logs, as SMART/Health
Information Log. That’s not to be confused with
the same thing as S.M.A.R.T. Counters for SATA. Those
are two different things. So all of this gets really confusing. So there’s a consolidation effort
made at Microsoft to make the Get-StorageReliabilityCounter
PowerShell cmdlet. But this also had issues. As you’ll notice in the screenshot, support various by manufacturer
and that’s because there’s different layers in
the stack that may not be able to translate
all the vital information. The classic example of this
is if you have a SAS HBA that’s attached to
your motherboard and you have SATA drives
that are linked to it, it’s hard to translate
that over and so we can’t always get the right
information in the cmdlet. On top of that, each
of these drives is running hundreds of thousands
of lines of code. The firmware interfaces with
even more code in the driver. Okay. So there’s obviously
a lot of complexity here. So what if we just look
at the system event logs, we look at the disk log, and look at event IDs 7, 153, 154. You can do that. But that gets
even more complicated because if you’re running a hyperconverged
system with Storage Spaces Direct, you’ll have to look at its operational
logs which are 203 and 204. Now, you can also look
at perf counters, that usually gives
you a good indicator of your physical disk performance. But on top of that, if
you’re running clustering, you have to also look at
the Cluster Storage Disk and Clusters Storage Cache counters because they have
their own perf counters. This list just keeps going on and
on and on and what really makes this problem really hard
is the sheer diversity, and lack of standardization, and the lack simply of a single way
to investigate this issue.>>Yeah. I mean, you said it right. That is really complicated. That can get pretty involved. I don’t know many folks whose title is “I’m a
drives administrator.” It’s like no, if I’m an IT, I’ve got other things
I need to be doing. So how does Windows Server
2019 make this easier?>>Yeah. So good question. That’s why in Windows Server 2019, we introduced a new approach. So rather than what
these other solutions do which is focusing on making sure that the drive isn’t doing anything bad, we do something different. We make sure that it’s
easy to make sure that your drive is doing
what we needed to do. Since Storage Spaces
Direct is resilient, we can be reactive, which means, we don’t need to predict, drive flakiness or
failures in advance, we just need to recognize it
so that we can act decisively. So what do we actually
need from a drive? Well, we need two things. The first is it should be able to complete the IO
requests successfully. When Windows needs to read or write, the drive should be able to read that data or write it successfully. In some cases, there’s
exceptions to that where failures happen and we need to be able to indicate
those and bubble them up. The second thing is, it should
complete these IOs promptly. When doing read IO or write IO, the operation should
occur in a consistent, predictable way in
a short amount of time. For HDDs on average, that’s 10 to 20
milliseconds and for SSDs, it’s around one to two milliseconds. That’s the average
normal expected time. If it takes anywhere
much longer than that, then we know that this drive
could need to be replaced. So the new drive latency
and error statistics, the best part about this is that it works with every single
drive that you have. You don’t need special vendor or
hardware requirements for it. It works with
all protocols; SATA, SAS, NVMe and all media
types; SSDs, HDDs, PMM. The great thing is, it’s always on, no configuration is needed. Now, the way that you
access it is through a new visualization in
Windows Admin Center, which I’ll show you soon
and through a new cmdlet in PowerShell that operates
closely with Get-PhysicalDisk. The reason why this is so powerful is because we record this
data in the right place. We go to the lowest edge of
the Windows software stack and we’re able to monitor
IO successes and failures, and using this, we recreate
a latency distribution, we don’t just take the average. So this histogram of
sorts has 12 buckets, ranging from one-fourth of a millisecond all the way
up to 20 seconds. So for each of these buckets, we categorize your IOs and we take the hourly maximum and the average. That gives you basically
the right data collected in the right view so that you can
figure out what’s going wrong.>>So you said we can see this
in Windows Admin Center, right?>>Yes.>>Can we see what that looks like.>>Yeah. I would love to show you. So this is a demo of everything we’ve been discussing
in Windows Admin Center. So if we open up Windows Admin Center and we
go to the “Drives” page, you’ll see all of your drives are
listed in their unhealthy state. If we click on one of them, you’ll see that this is a SSD drive. Let’s full-screen
this page and scroll down. At the bottom here, you’ll see a new feature for “Drive
latency and error statistics”. What this basically is saying that, it’s telling you-all the
data that we’re going to be collecting to help you
make decisions on, if the drive is
performing well or not. So let’s click “Show Latency
and error statistics”. So this successfully completed
and now at the bottom, you’ll notice we have
some data there. So if we zoom in, we’ll see that we’re
actually showing statistics from the last 100 intervals, and each interval is hour long. So we’re showing the last
100 hours worth of data. So this SSD in the last 100 hours performed a total of eight million
IOs and that’s read and writes. Out of these, so there was
a 142 gigs read and 41 gigs written. The average latency was
under 200 milliseconds, that’s 200 microseconds,
that’s 0.2 milliseconds. The incredible thing is, out of all of these eight
million IOs that happened, none of them failed. So this SSD is actually
performing really well. If you look at the latency graph, you’ll notice that we can select the threshold to see
if the IOs are actually slow. Right now, we selected
over two seconds as that threshold. We can change this
to 64 milliseconds, we can change it to 16 milliseconds, and still we see no slow IOs. Now, that’s incredible. Think
about that for a second. This disk, this SSD performed
eight million IOs and none of them took more than
16 milliseconds to complete. That means it’s performing
really, really well. So if we decrease this threshold
to four milliseconds, now we’ll actually see some IOs
passing through this threshold. So you see 1000 of these eight million IOs actually
took more than four milliseconds, and that represents 0.02
percent of all of these IOs. Here you’ll see a graph, a histogram of how
many failed per hour. As we noted before, there’s 100 intervals over the last 100 hours of data
that we’re collecting here. So the cool part about Windows
Server 2019 is we raise a health service fault to alert you when this drive is
misbehaving and it’s slow. So the way this fault works is
we check if the drive is slow, we check if its peers aren’t slow, if they’re performing fine, and we check for if this has been happening for
enough minutes in a row. We bubble this up, not only in
Windows Admin Center but also, as I said it’s a fault, so you can run Get-Help Fault and see it there. But lastly, we integrated this fault very closely with Get-PhysicalDisk. So if you just run
Get-PhysicalDisk, normally, you’ll see now in
the operational status, if we see some sort of
abnormal latency displayed there.>>Wow. So that’s some pretty
powerful functionality. I think I’ve heard Microsoft also does this in Microsoft Azure, right?>>Yes, that’s right. So this type of outlier
detection functionality, it’s much more advanced than what you typically see from
other on-premise vendors. But what we do in the public Cloud, it just felt right for us to
bring this to our customers and bring that to on-premises
with Windows Server 2019, so that you can make
sure that you are proactively getting some
of these advanced features that are sometimes only
available in the Cloud.>>Wow. Okay. So in
Windows Server 2019, we’ve got new tools so that
I can see my storage status, so that I can see if my hardware is performing correctly, that’s awesome. Now, supposing I do have some kind of an issue that
I’m not able to troubleshoot. So maybe I’m going to need
to contact Microsoft, or contact my vendor, or contact someone that I trust
to help me with it, what do I do?>>Good question. The last thing that you want to get
into is a back-and-forth running different cmdlets or
fetching individual logs, because what that does is, it allows enough time
to pass and eventually, that evidence gets overwritten.>>Right. Ideally, I just
want to do one thing, click one button and get all of the information that is going
to be needed to troubleshoot.>>Well, I have good news for you. We’ve done exactly that. We’ve made troubleshooting faster and easier with
Get-SDDCDiagnosticInfo. This is our one-touch
diagnostics gathering tool. What it does is, it collects all
of the event logs, counters, and other information
into a single zip, and we’ve optimized it to run
in parallel on your cluster. So on a 16-node cluster, it only takes five minutes to gather
all the information you need. After doing that, it creates a file called the
Health Summary Report, which gives you an overview
of what components are unhealthy and what you
should go and investigate. We’ve built this option,
as I’ll show you, to schedule this to run
regularly once a day. So that in the worst-case scenario, you’ll have this historical series
of logs and data that you can go off of to see when your system started having errors and failing. This is an ongoing project that we’ve developed open source on GitHub. We take a snapshot of
every minor release about every few weeks and update
the PowerShell gallery. Since it’s constantly improving
based on customer feedback, we’re able to collect
the right information in the worst-case scenario. So if there’s any tools
that you’ve been using or any things that you feel
that we’re not collecting, feel free to check out our GitHub
repo and submit a pull request. Now, our roadmap for diagnostics
is a three-step process. The first step is to help you gather
all of the right information, so that you can root cause
faster and resolve your issues. The second is, once we
gather all the information, figuring out how to
correlate across layers in the stack to pinpoint
exactly what’s going wrong. That’s why we added
these new health faults and that’s why we give you this health summary
report to give you an overview of
the unhealthy components. The last thing is, now we’re
working on ways to provide insights using
commonly known patterns and issues that you might see. Now, what I really want to do now is demo how this experience looks
like in Windows Admin Center.>>Yeah. Lets see it.>>So when you go to
Windows Admin Center, you’ll see in this setup that we have a healthy system with two nodes, if we navigate to the
“Diagnostics” pane, it’ll take a second to connect
to your server and you’ll see that diagnostics tools aren’t
currently installed on your server. You’ll see the latest version
that’s available, and so let’s install it. While this is happening, something to notice when you run into issue like
you mentioned before, your hardware vendor or customer support might ask
you to run these tools, so it’s good to see
how to use this flow. So now that all of the tools
have been gathered, you’ll see a rocker switch
here to automatically schedule it to run every 24 hours. This is what we talked about earlier. If we enable this, the cmdlet and the backend will
register a storage archive job. What it does is, it
allows you to create that history of logs and diagnostic information so that
when something goes wrong, you can have a whole series
of data to reference. Now that the data has collected, you can go and see this current archive and other archives that
you’ve done in the past, and you can navigate in
the files tool to view this zip. Now, when you go to the “Files” tool, Files tool works very
similar to File Explorer. So you can see the same three
zips that we looked at before. So if we clicked on the one
that we just currently run, we can download it so
that we can unzip it, and look through it, and read
this health summary report. At the same time, just to show you, if you go to the properties
of this folder, you’ll notice that the size of
the zip is relatively small. It’s all the relevant
information that you need, and so it’s very easy to
send this zip once to customer support and they’ll be able to help you out with
whatever issues you have, rather than that worst-case scenario we were talking about earlier, where you have to go
back-and-forth and collect individual logs and that just
increases the time to resolution.>>So with just one click, that gives me that zip file that has everything that’s
going to be needed to troubleshoot the entire
hardware software stack for hyperconverged infrastructure, like storage, compute,
networking, all in one package.>>Exactly, and that
was exactly our goal.>>That’s wonderful.
So to recap then. In Windows Server 2019, there are new visualizations to understand the behavior
of the storage system, especially like
storage re-synchronization, and so there’s a new chart. You can see it directly on the
dashboard in Windows Admin Center, so you don’t need to drop into the command line and run
Get-Storage job anymore. There’s the new drive error
and latency statistics, as well as built-in
outlier detection, which is pretty neat technology. Then there’s this one
touch log gathering, so you can easily get all of
the information that you need with just a single click in
Windows Admin Center, right?>>Yeah, you got all of it. Our goal is to make Windows Server 2019 the most reliable
operating system that gives you the tools to easily manage your hyperconverged Storage
Spaces Direct clusters. So to find solutions from
your preferred hardware vendor, you can visit Microsoft.com/HCI. To install and checkout
Windows Server 2019, you can go to aka.ms/WindowsServer. To manage your Windows Server
Instance with Windows Admin Center, you can check out
aka.ms/WindowsAdminCenter.>>All right. Well, that’s
some really exciting technology. Adi, thank you very much. For more information about
Windows Server 2019, be sure to watch the other sessions
of the Windows Server Summit, which will all be
available on- demand. Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *