On Mon, 6 Apr 2020 at 08:58, Timothée Floure timothee.floure@posteo.net wrote:
Hi,
How does the indexing works ?
You point Yacy to a domain or list of URLs (https://fnux.fedorapeople.org/pkgs/ in this case), and it takes care of everything. There is also an advanced crawler panel in the UI allowing you to filter content (e.g. HTML classes) from pages, which would be useful if we do not want to index everything (e.g. dependencies).
I am not familiar with the maths used by Yacy for indexing.
Me neither :D
And what would it take to add more info for each package ?
I wrote a quick script (https://paste.gnugen.ch/raw/4JAC) fetching package metadata from PDC+mdapi for testing, but it is ways too slow to scale to the whole package set.
Cool, yeah the current indexing takes hours (I think around 4-5 hours) there are more than 80 000s packages and sub-packages. I think we can run this once a day so speed is not super super critical I would say.
MDAPI will have to be replaced by local SQLite to increase performance. I think we could generate most of the content from the repositories' metadata (last N Fedora + EPEL) but I need to find where the SQL files lives. A privileged endpoint to dist-get to fetch the package -> maintainer mapping bypassing pagination would be convenient.
You can look at how mdapi grabs these sqllite files ( https://pagure.io/mdapi/blob/master/f/mdapi-get_repo_md). For the maintainer mapping you should be able to find that here --> https://src.fedoraproject.org/extras/
We can use Yacy's JSON API to build a sexy fedora-branded search page but I think it's a late-stage optimization.
+1, would be interesting to see with msuchy if that can be easily integrated with the work he was doing.
-- Timothée _______________________________________________ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedorapro...