How googlebot find's content type?
How does google bots get to know content-type of the document ? Say its web-pages(html), pdf, excel doc etc.
Here is what I get to know:
Googlebot is not doing HEAD call to every URL. Without HEAD call, how can the content-type be known to parse/understand the document is about?
As per this article, google bots use chromium browser to load every url. Then loading a huge pdf will be too costly ( time, memory etc.). But instead of loading in browser, what if we wget/curl it and parse it; which is much simpler. So basically, how Google is handling this at scale?