It also ensures that there is a subset of domains thats truly feasible to crawl in acceptable time body. I then included in some logic to rank matches in domains/urls and titles increased than information, penalize scaled-down matching files and reward lengthier types (to offset the bias thats in BM25). I didn’t want to overthink this so I executed BM25 position for the major position calculation.