Re: Better repodata performance

Sunday, 30 January 2005

On Jan 30, 2005, Jeff Johnson <n3npq(a)nc.rr.com&gt; wrote:

...
 More seriously, I'm like a weekend's work away from adding a
look-aside
 cache to rpm -4.4.x (which has a relaible http/https stack using neon)  that
 could be invoked asynchronously to yum as
     rpm -q http://host/path/to/N-V-R.A.rpm
 and then yum could read the header from the package as it was being
 downloaded 
Err...  By the time yum or any other depsolver decides to download a
package, it's already got all the headers for all packages.  And I
hope you're not suggesting yum to get rpm to download *all* packages
just because it needs headers.  *That* would be a waste of bandwidth.

...
 into /var/cache/yum/repo/packages since you already know the header
 byte range you are interested in from the xml metadata, thereby
 saving the bandwidth used by reading the header twice. 
Hmm...  I hope you're not saying yum actually fetches the header
portion out of the rpm files for purposes of dep resolution.  Although
I realize the information in the .xml file makes it perfectly
possible, it also makes it (mostly?) redundant.  Having to download
not only the big xml files but also all of the headers would suck in a
big way!

I was thinking to myself that having to download only the compressed
xml files might be a win (bandwidth-wise) over going though all of the
headers like good old yum 2.0 did, at least in the short term, and for
a repository that doesn't change too much.

But having to download the xml files *and* the rpm's headers upfront
would make the repodata format a bit loser, because not only would you
waste a lot of bandwidth with the xml files, that are much bigger than
the header.info files, but also because fetching only the header
portion out of the rpm files with byte-range downloads makes them
non-cacheable by say squid.

I'd be very surprised if yum 2.1 actually worked this way.  I expect
far better from Seth, and from what I read during the design period
of the metadata format, I understood that the point of the xml files
was precisely to avoid having to download the hdr files in the first
place.  So why would they be needed?  To get rpmlib to verify the
transaction, perhaps?

...
 That's a far bigger bandwidth saving than attempting to fragment
 primary.xml,
 which already has timestamp checks to avoid downloading the same file
 repeatedly 
The problem is not downloading the same file repeatedly.  The problem
is that, after it is updated, you have to download the entire file
again to get a very small amount of new information.  Assuming a
biggish repository like FC updates, development, pre-extras, extras or
dag, freshrpms, at-rpms, newrpms, etc, it's a lot of wasted
bandwidth.

-- 
Alexandre Oliva             http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer   aoliva(a){redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva(a){lsd.ic.unicamp.br, gnu.org}

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Better repodata performance