One of my next goals to improve mirror crawling is to split the crawls of the mirrors by category. Right now we select a mirror and crawl all categories (Fedora Linux, Fedora EPEL, Fedora Secondary, Fedora Archives, Fedora Other) in one go. The drawback is that it is nearly impossible to crawl a mirror which mirrors everything within the time limit of 3 hours. There are a few mirrors which actually mirror everything and they are usually dropped from the mirror list because the crawler always hits the 3 hour limit and marks the mirror as not being up to date. The current solution is to create multiple hosts (which can point to the same mirror) with only one or two categories. This works but it is not the optimal solution.
The actual scanning of the remote mirror is most of the time not the real problem, but also updating the status of all those directories and files in the local database takes a very long time.
The master crawling by update-master-directory-list (umdl) is already split up by category and fedmsg driven (for most categories). So whenever a repository is updated umdl starts a scan and updates the database for only the category which has changed. This works pretty good but has the disadvantage that the database is now much faster updated without the possibility for the mirrors to sync before we have new information in the database.
The reason for this long introduction is that my original plan was to immediately start a category crawl after umdl has signalled that a certain category has been updated in the database. This could lead to a very short list of mirrors which are up to date and therefore I would like to know if we should somehow introduce a delay between the time umdl has run and the time we start to crawl the mirrors. This would give the mirrors some time to sync the content before we crawl them.
Right now the time between the update of the master mirror and the crawl can be between 0 hours and 12 hours. With a defined time before crawling the mirrors this would be more clearer than right now.
I am also hoping to be able to crawl the mirrors more often than twice a day if moving to category based crawls.
So my main question is if we should insert a delay between umdl and the crawl of the mirrors? This would require a fedmsg emitted at the end of an umdl run and something on the crawler which waits some time before starting the crawls.
Adrian
_______________________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject....
On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote: [...]
So my main question is if we should insert a delay between umdl and the crawl of the mirrors? This would require a fedmsg emitted at the end of an umdl run and something on the crawler which waits some time before starting the crawls.
Thinking more about it, it actually does not make much sense to base the mirror crawls on fedmsg. The mirrors are updated at (from our point of view) random times. So with category based crawling we have the possibility to increase the crawl frequency for Fedora Linux and Fedora EPEL and decrease it for Fedora Archive. Which should hopefully give MirrorManager a better view of the status of the mirrors.
Adrian
_______________________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject....
On Fri, 11 Sep 2015 20:42:23 +0200 Adrian Reber adrian@lisas.de wrote:
On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote: [...]
So my main question is if we should insert a delay between umdl and the crawl of the mirrors? This would require a fedmsg emitted at the end of an umdl run and something on the crawler which waits some time before starting the crawls.
Thinking more about it, it actually does not make much sense to base the mirror crawls on fedmsg. The mirrors are updated at (from our point of view) random times. So with category based crawling we have the possibility to increase the crawl frequency for Fedora Linux and Fedora EPEL and decrease it for Fedora Archive. Which should hopefully give MirrorManager a better view of the status of the mirrors.
Well, mirrors that are using your script to trigger syncs after a fedmsg would be syncing right after that as well, but might depend on how long it takes them to sync.
kevin
_______________________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject....
On Sat, Sep 12, 2015 at 12:52:55PM -0600, Kevin Fenzi wrote:
On Fri, Sep 11, 2015 at 04:56:41PM +0200, Adrian Reber wrote: [...]
So my main question is if we should insert a delay between umdl and the crawl of the mirrors? This would require a fedmsg emitted at the end of an umdl run and something on the crawler which waits some time before starting the crawls.
Thinking more about it, it actually does not make much sense to base the mirror crawls on fedmsg. The mirrors are updated at (from our point of view) random times. So with category based crawling we have the possibility to increase the crawl frequency for Fedora Linux and Fedora EPEL and decrease it for Fedora Archive. Which should hopefully give MirrorManager a better view of the status of the mirrors.
Well, mirrors that are using your script to trigger syncs after a fedmsg would be syncing right after that as well, but might depend on how long it takes them to sync.
Yes, my mirror syncs from ::fedora-buffet0/ and that takes a few hours.
Adrian
_______________________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org http://lists.fedoraproject.org/postorius/infrastructure@lists.fedoraproject....
infrastructure@lists.fedoraproject.org