For some time, we've wanted to explore search engine options, so that you
could go to one site and search all of the Fedora sites that we run.
Examples include: Wiki (wiki search is not that great), docs.fp.o, pkgdb,
etc.
Relative information from last time this was discussed:
https://fedorahosted.org/fedora-infrastructure/ticket/1055
http://fedoraproject.org/wiki/Infrastructure/Search
I've been playing with various options on one of our junkXX boxes, and
seeing what works well.
- I tried Sphinx, but it seems this is really just a database fulltext
search, not a full-out search engine and crawler solution.
- I tried Xapian, but getting it crawling required a lot of hacking and
conversion from an external crawler (e.g. htdig), and htdig kept throwing
traces and dying, on https sites.
- I tried mnoGoSearch, its CGI would not work at all. It would simply
timeout when I tried to go to it.
- I lastly tried Datapark Search, which seems like our best bet:
- I ran into an issue where randomly the crawler would throw traces
about libcrypto. I reported the issue upstream and they released a snapshot
release two days later that seems to have fixed the issue. So upstream is
active.
- I played with some styling ideas, and tried to incorporate search
results into the standard Fedorahosted/people/wiki template. Needs some
work to finish this, but it's getting there.
- The default CGI template had horrid HTML, but I worked with that and
got it reasonable (going to finish it up today or tomorrow and try to get
it passing as valid html 5).
But out of the options I tried, this seems like the best one available. It
is a fork of mnoGoSearch. It has a lot of options to customize it, and
shape it into what we want it to do.
That said, I am more than open to trying other options before we decide to
move forward with Datapark. If nobody screams over the next few days, I
will work on moving forward. We need to package it, and it looks like we'll
have to package the snapshot version.
Anyway I am just throwing this out to update everyone on my findings, and
see if anyone has ideas for other options.
-re