Answers from Aaron
A couple of days ago, Aaron Swartz posted an entry on his Google Weblog, requesting questions about Google. I sent off some questions I had regarding how Google indexes pages as it crawls around the net.
He answered me tonight that the threshold (that I’ve suspected to be used to determine which pages go into the main index) is the PageRank. Obviously! Why have another measurement for pages? I didn’t think of that.
He also wrote that the system of crawling pages every day is something Google began doing recently, and that he suspects that the “Similar pages” links use “their linking index, which likely is only part of the permanent crawl (since they don’t care about temporary pages that much)”.