Log File Analysis FAQs

Which user agents can I configure Conductor Monitoring to detect in my log files?

You can detect the following user agents with Conductor Monitoring:

Google

Google uses a single user agent, Googlebot, for both its desktop and mobile crawling, though the full user-agent string differs. Its crawling is based on a mobile-first approach.

Full user-agent string: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html); Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
(Note: The Chrome version (W.X.Y.Z) is regularly updated to reflect a recent version of the browser.)
Published IP addresses: https://developers.google.com/search/apis/ipranges/googlebot.json

Bing

Bing uses a single user agent, Bingbot, for both its desktop and mobile crawling, though the full user-agent string differs. Its crawling is based on a mobile-first approach.

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36; Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm); W.X.Y.Z Safari/537.36; Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Published IP addresses: https://www.bing.com/toolbox/bingbot.json

Open AI - OpenAI Platform

GPTBot is used to crawl content that may be used in training OpenAI's generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models.

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot
Published IP addresses: https://openai.com/gptbot.json

Open AI - search

OAI-SearchBot is used to link to and surface websites in search results in ChatGPT's search features. It is not used to crawl content to train OpenAI’s generative AI foundation models. To help ensure your site appears in search results, OpenAI recommends allowing OAI-SearchBot in your site’s robots.txt file and allowing requests from their published IP ranges below.

Full user-agent string will contain: OAI-SearchBot/1.0; +https://openai.com/searchbot
Published IP addresses: https://openai.com/searchbot.json

Open AI - user

When users ask ChatGPT or a CustomGPT a question, it may visit a web page with a ChatGPT-User agent. ChatGPT users may also interact with external applications via GPT Actions. ChatGPT-User is not used for crawling the web in an automatic fashion, nor to crawl content for generative AI training.

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
Published IP addresses: https://openai.com/chatgpt-user.json

Perplexity - Perplexity Crawlers - Perplexity

PerplexityBot is designed to surface and link websites in search results on Perplexity. It is not used to crawl content for AI foundation models. To ensure your site appears in search results, Perplexity recommends allowing PerplexityBot in your site’s robots.txt file and permitting requests from their published IP ranges listed below.

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Published IP addresses: https://www.perplexity.com/perplexitybot.json

Perplexity - user

Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models. (Since a user requested the fetch, this fetcher generally ignores robots.txt rules.)

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)
Published IP addresses: https://www.perplexity.com/perplexity-user.json

How does Conductor ensure that log ingestion is anonymous and that no sensitive info or PII is being passed?

Cloudflare Worker

At the point of log generation in Cloudflare Workers, no PII is generated, so the logs are free of PII at their inception. We do this by generating those logs through a script, all automatically, which tracks visits only of search engines.

Once the Cloudflare Worker is installed, the code is also observable and auditable in real time in Cloudflare Workers by your team, although minified.

Akamai DataStream, CloudFront Standard Logging, Fastly Real-Time Log Streaming, Cloudflare Logpush

Akamai DataStream, CloudFront Standard logging, Fastly Real-Time Log Streaming and Cloudflare Logpush send logs to AWS S3 bucket that Conductor Monitoring provisions when you enable the Log File Analysis feature. To ensure that only Search Engine traffic is sent to Conductor Monitoring, there is a two-level filtering in place.

1st level filters out all non-search engine traffic from the AWS S3 bucket, so only non-PII (non-personally identifiable information) data is sent to Conductor Monitoring servers. This is done by matching the traffic with the search engine’s IP addresses.
To err on the side of caution, 2nd level of filtering is implemented on Conductor Monitoring servers where Conductor Monitoring again checks to ensure only search engine traffic based on the IP addresses is being processed and if not, the data is filtered away.

Furthermore in your Akamai and Fastly account it is possible to configure Akamai DataStream and Fastly Real-Time Log Streaming to only send search engine traffic to the AWS S3 bucket. We require doing so and you can learn how to filter away non-search engine traffic here in these lined articles.

How is the Cloudflare Worker invoked?

The Cloudflare Worker acts as a middleware between the visitor (search engine) and your website. It intercepts requests specifically made only by selected search engine bots, and sends a message to Conductor Monitoring backend. This is done in a non-blocking way, and it cannot affect what response the visitor will receive.

To connect Conductor Monitoring to Cloudflare, you will need to create an API token in your Cloudflare account. Using this token, Conductor Monitoring will automatically install its Cloudflare Worker for this website without any additional configuration required from your side.

Afterwards, the Cloudflare Worker will continuously detect and track visits of search engines to this website. The Cloudflare API token is deleted from our systems immediately after the Cloudflare Worker is installed.

At what execution interval does the Cloudflare Worker run?

It runs on every HTTP request made to your website, but it opts-out from doing anything else as soon as it decides that the visitor is not one of the supported search engines.

Important

Conductor Monitoring never tracks visits of actual visitors to your websites, and this data never gets to our systems.

Why don't I see one or more of the supported user agents in my reports?

Depending on your log source provider, you may need to configure criteria or update filters to include each of our supported user agents in your log files. If this is the case, be sure to include all of the following user agents:

bingbot
Googlebot
OAI-SearchBot
ChatGPT-User
GPTBot
PerplexityBot
Perplexity-User

Note that these strings are case sensitive, so be sure you are using the correct lower- and upper-case letters as shown above.

Can I integrate log file data from my data aggregation platform?

No, Conductor Monitoring does not support integrations with data aggregation platforms, and supports only integrations with the following CDNs:

Cloudflare Worker
Cloudflare Logpush
Akamai DataStream
CloudFront Standard logging
Fastly Real-Time Log Streaming