On 2016-04-28 22:26, Jon Masters wrote:
On 04/28/2016 05:00 PM, Gordan Bobic wrote:
> On 2016-04-28 19:49, Jon Masters wrote:
> First of all, Jon, thank you for your thoughts on this matter.
No problem :)
>> Allow me to add a few thoughts. I have been working with the ARM
>> (as well as the ARM Architecture Group) since before the architecture
>> was announced, and the issue of page size and 32-bit backward
>> compatibility came up in the earliest days. I am speaking from a Red
>> perspective and NOT dictating what Fedora should or must do, but I do
>> strongly encourage Fedora not to make a change to something like the
>> page size simply to support a (relatively) small number of corner
> IMO, the issue of backward compatibility is completely secondary to
> the issue of efficiency of memory fragmentation/occupancy when it
> to 64KB pages. And that isn't a corner case, it is the overwhelmingly
> primary case.
Let's keep to the memory discussion then, I agree. On the fragmentation
argument, I do agree this is an area where server/non-server uses
certainly clash. It might well be that we later decide in Fedora that
is the right size once there are more 64-bit client devices.
As an additional factoid to throw into this, one obvious case where
large pages can be beneficial is databases. But speaking as a
database guy who measured the positive impact of using huge pages
on MySQL, I can confirm that the performance improvement arising
from putting the buffer pool into 1MB huge pages instead of 4KB
pages is in the 3% range. And that is when using 1MB pages instead
of 4KB pages. While I haven't measured it, it doesn't seem
unreasonable to extrapolate the following:
1) 4KB -> 64KB pages will make less difference than 4KB -> 1MB
pages in this case this use case that is supposed to be the
prime example where larger memory pages make a measurable difference.
2) Regardless of whether we use 4KB or 64KB standard pages,
we can still use huge pages anyway, further minimizing the
usefulness of the 64KB compromise.
>> Having an entire separate several ISAs just for the fairly
>> nonexistent field of
>> proprietary non-recompilable third party 32-bit apps doesn't really
>> sense. Sure, running 32-bit via multilib is fun and all, but it's not
>> really something that is critical to using ARM systems.
> Except where there's no choice, such as closed source applications
> (Plex comes to mind) or libraries without appropriate ARM64
> such as Mono. I'm sure pure aarch64 will be supported by it all at
> some point, but the problem is real today.
It's definitely true that there are some applications that aren't yet
ported to ARMv8, though that list is fairly small (compared with IA32).
> But OK, for the sake of this discussion let's completely ignore the
> 32-bit support to simplify things.
>> The mandatory page sizes in the v8 architecture are 4K and 64K, with
>> various options around the number of bits used for address spaces,
>> pages (or ginormous pages), and contiguous hinting for smaller "huge"
>> pages. There is an option for 16K pages, but it is not mandatory. In
>> server specifications, we don't compel Operating Systems to use 64K,
>> everything is written with that explicitly in mind. By using 64K
>> we ensure that it is possible to do so in a very clean way, and then
>> (over the coming years) the deployment of sufficient real systems
>> that this was a premature decision, we still have 4K.
> The real question is how much code will bit-rot due to not being
> tested with 4KB pages
With respect, I think it's the other way around. We have another whole
architecture targeting 4K pages by default, and (regretfully perhaps,
though that's a personal opinion) it's a pretty popular choice that
people are using in Fedora today. So I don't see any situation in which
4K bitrots over 64K. I did see the opposite being very likely if we
didn't start out with 64K as the baseline going in on day one.
Perhaps. Hopefully this won't be an issue at least as long as Fedora
ships both 32-bit and 64-bit ARM distros.
>> I also asked a few of the chip
>> vendors not to implement 32-bit execution (and some of them have
>> omitted it after we discussed the needs early on), and am
>> pushing for it to go away over time in all server parts. But there's
>> more to it than that. In the (very) many early conversations with
>> various performance folks, the feedback was that larger page sizes
>> 4K should generally be adopted for a new arch. Ideally that would
>> been 16K (which other architectures than x86 went with also), but
>> was optional. Optionally necessarily means "does not exist". My
>> when Red Hat began internal work on ARMv8 was to listen to the
> Linus is not an expert?
Note that I never said he isn't an expert. He's one of the smartest
around, but he's not always right 100% of the time. Folks who run
performance numbers were consulted about the merits of 64K (as were a
number of chip architects) and they said that was the way to go. We can
always later decide (once there's a server market running fully) that
this was premature and change to 4K, but it's very hard to go the other
way around later if we settle for 4K on day one. The reason is 4K works
great out of the box as it's got 30 years of history on that other
but for 64K we've only POWER to call on, and its userbase generally
aren't stressing the same workloads as on 64-bit ARM. Sometimes they
are, and that's been helpful with obscure things like emacs crashing
to a page size assumption or two on arrow presses.
Indeed, but the POWER hardware also tends to be used in rather niche
cases, and probably more often with large databases than x86 or ARM.
And as I mentioned above, even on workloads like that, the page size
doesn't yield ground breaking performance improvements. Certainly
nearly enough improvement to offset the penalty of, say, the hypervisor
>> I am well aware of Linus's views on the topic and I have
>> on G+ and elsewhere. I am completely willing to be wrong (there is
>> enough data yet) over moving to 64K too soon and ultimately if it was
>> premature see things like RHELSA on the Red Hat side switch back to
> My main concern is around how much code elsewhere will rot and need
> attention should this ever happen.
I think, once again, that any concern over 4K being a well supported
page size is perhaps made moot by the billions of x86 systems out there
using that size. Most of the time, it's not the case that applications
have assembly code level changes required for 64K. Sure, the toolchain
will emit optimized code and it will use adrp and other stuff in v8 to
reference pages and offsets, but that compiler code works well. It's
the piece that's got any potential for issue. It's the higher level C
code that possibly has assumptions to iron out on a 64K base vs 4K.
Indeed, the toolchain output is a concern - specifically anything that
would cause aarch64 binaries to run with 64KB kernels but not 4KB ones.
But I concede that at this stage such bugs are purely theoretical. I
have certainly not (yet?) found anything in an aarch64 distro that
breaks when I replace the kernel with one that uses 4KB pages.
>> Fedora is its own master, but I strongly encourage retaining
>> 64K granules at this time, and letting it play out without responding
>> one or two corner use cases and changing course. There are very many
>> design optimizations that can be done when you have a 64K page size,
>> from the way one can optimize cache lookups and hardware page table
>> walker caches to the reduction of TLB pressure (though I accept that
>> huge pages are an answer for this under a 4K granule regime as well).
>> would be nice to blaze a trail rather than take the safe default.
> While I agree with the sentiment, I think something like this is
> better decided on carefully considered merit assessed through
> empirical measurement.
Sure. We had to start with something. Folks now have something that
can use to run numbers on. BUT note that the kind of 64-bit hw that is
needed to really answer these questions is only just coming. Again, if
64K was a wrong choice, we can change it. It's only a mistake if we
always dogmatically stick to principle in the face of evidence to the
contrary. If the evidence says "dude, 64K was at best premature and
Linus was right", then that's totally cool with me. We'll meanwhile
a codebase that is even more portable (different arch/pagesz).
Fair enough. I guess the next step would be to actually run some
>> My own opinion is that (in the longer term, beginning with
>> should not have a 32-bit legacy of the kind that x86 has to deal with
>> forever. We can use virtualization (and later, if it really comes to
>> containers running 32-bit applications with 4K pages exposed to them
>> an implementation would be a bit like "Clear" containers today) to
>> 32-bit applications on 64-bit without having to do nasty hacks (such
>> multilib) and reduce any potential for confusion on the part of users
>> (see also RasPi 3 as an example). It is still early enough in the
>> evolution of general purpose aarch64 to try this, and have the
>> fallback of retreating to 4K if needed. The same approach of running
>> under virtualization or within a container model equally applies to
>> ILP32, which is another 32-bit ABI that some folks like, in that a
>> party group is welcome to do all of the lifting required.
> This again mashes 32-bit support with page size. If there is no
> 32-bit support in the CPU, I am reasonably confident that QEMU
> emulation if it will be unusably slow for just about any serious
> use case (you might as well run QEMU emulation of ARM32 on x86
> in that case and not even touch upon aarch64).
Point noted. If we keep the conversation purely to the relative merits
of 64K vs 4K page size upon memory use overhead, fragmentation, and the
like, then the previous comment about getting numbers stands. This is
absolutely something we intend to gather within the perf team inside
Hat (and share in some form) as more hardware arrives that can be
realistically used to quantify the value. You're welcome to also run
numbers and show that there's a definite case for 4K over 64K.
Indeed I intend to, but in most cases getting real world data tu run
such numbers is non-trivial. Any real data large enough to produce
meaningful results tends to belong to clients, who by and large
run on x86 only. So right now the best I can offer is experience
that on database workloads huge pages outperform 4KB pages by very
low single figure % points.
It is therefore questionable how much difference using 64KB non-huge
pages might actually make in terms of performance, while increases
in memory fragmentation are reasonably well understood.
It strikes me that this is something better tested in a lab
rather than guinea-pigging the entire user base, most of
whom aren't fortunate enough to have machines with tons of
RAM to not care.