In answer to your questions about the Equivalent-Packages process:
 
1) You are right that the tool can get confused when there is little source in the package or if the majority of files include common things like readme/todo/makefile etc.  One thing I could do is exclude source files which are excessively common. I do that in other tools where I use package similarity, but tried to keep this package equivalent tool as simple as possible. It is not difficult to implement and it should reduce some of the false positives.
 
2) When there are multiple possible matches, I simply choose the package with the highest similarity. One thing that I did was to run the tool against one repo only (instead of say the Fedora repo against the Debian repo) to find near duplicate packages. I have done this only so far for Debian https://github.com/silviocesare/Equivalent-Packages/blob/master/Clusters/Debian5Cluster
 
The first entry in the list is the base package, the remaining entries are the near duplicate packages and their similarities to the base package. An example from the Debian repo -->
 
libxml-um-perl libapp-control-perl:0.846154 libcrypt-des-ede3-perl:0.846154 libdata-buffer-perl:0.846154 libdb-file-lock-perl:0.846154 libio-tee-perl:0.846154 liblingua-preferred-perl:0.846154 liblingua-pt-stemmer-perl:0.846154 liblog-tracemessages-perl:0.846154 libpdf-reuse-barcode-perl:0.846154 libsort-fields-perl:0.846154 libtemplate-plugin-calendar-simple-perl:0.846154 libxml-filter-detectws-perl:0.846154 libxml-filter-saxt-perl:0.846154 libxml-handler-printevents-perl:0.846154 libxml-handler-trees-perl:0.846154 libxml-regexp-perl:0.846154

One possible method of reducing false positives is to ignore packages which are equivalent to more than one other package. Or perhaps it could require human intervention.
3) I did some trivial testing of unexpected matches. In fact one thing I looked at was when the same package name was in Fedora and Debian but the similarity was so low it didn't match. Suprisingly a not insignificant number of packages were like this. And manual verification showed in the ones I looked at, they were different packages. This demonstrates that if you base equivalence on names only, then you will get false positives.
 
I could add heuristics based on the package name to request human intervention, ie. if two packages are found similar and if the package names do not have 50% overlap, then request human verification. I am not sure how useful this will be because from experience, package names can sometimes be problematic. 
 
--
Silvio
 
On Tue, Feb 1, 2011 at 1:13 AM, Tomas Hoger <thoger@redhat.com> wrote:
Hi Silvio!

On Mon, 31 Jan 2011 19:21:39 +1100 Silvio Cesare wrote:

> Debian maintain a list of CPE inormation for packages on their
> security tracker
> http://svn.debian.org/wsvn/secure-testing/data/CPE/list

We currently do not use CPE names for security tracking in Fedora, so
I don't see an obvious benefit maintaining such list.  Can you explain
briefly how you use it for Debian security tracking and what benefits
it brings?

> This makes it relatively static except when packages are added or
> removed from the repository.

It's not that uncommon to see new packages added to Fedora repositories
even after the release of some Fedora version.

> In the past I generated an automatic mapping between packages in
> Debian and Fedora
> https://github.com/silviocesare/Equivalent-Packages/blob/master/NearestNeighbour/Debian5_Fedora13_Matches

I played a little more with this list and noticed few problems:
- quite a few Debian packages map to Fedora arptools or binclock.
 Probably packages with not much sources, where other file (license,
 configure) confuse your tool to match unrelated packages
- there does not seem to be a good way to list cases where multiple
 components contain the same sources.  In Fedora, mingw32-* packages
 are a good example, and the list often maps Debian package foo to
 Fedora package mingw32-foo, while there is Fedora package foo that
 should be similarly good match.  Another example is
 zlib:arm-gp2x-linux-zlib.

Did you review "unexpected matches" to see if the sources are really
similar, and how the match is picked when there are multiple "good
candidates"?

--
Tomas Hoger / Red Hat Security Response Team