Most of our readers probably know Google Dorks. Some may have also searched through data in public S3 buckets or implemented a scraper for Pastebin. But are you familiar with urlscan.io, and what kind of data you can find there that is arguably even more private?
-- MARKDOWN --
# Table of Contents
- [Introduction: Github data leak](#introduction-github-data-leak)
- [What is urlscan.io?](#what-is-urlscan.io-)
- [Lots of data](#lots-of-data)
- [Lots of integrations](#lots-of-integrations)
- [What sensitive data can be mined?](#what-sensitive-data-can-be-mined-)
- [urlscan.io dorks](#urlscan-io-dorks)
- [Apple seems to have disappeared?](#apple-seems-to-have-disappeared-)
- [Where does the data come from?](#where-does-the-data-come-from-)
- [Contacting users](#contacting-users)
- [Contacting urlscan.io](#contacting-urlscan-io)
- [How exactly did the data end up there?](#so-how-exactly-did-the-data-end-up-there-)
- [Multi-step account takeover](#multi-step-account-takeover)
-- /MARKDOWN --
In February this year, GitHub sent an email to affected customers notifying them of a data breach. Specifically, users that enabled hosting via Github Pages for a private repository had the repository name leaked (together with their username).
There seems to be no public acknowledgement of this breach, and I was only made aware through a Hacker News post, where the full email was posted.
So, how did that happen?
From the mail:
GitHub learned from an internal discovery by a GitHub employee, that GitHub Pages sites published from private repositories on GitHub were being sent to urlscan.io for metadata analysis as part of an automated process
GitHub responded to the breach by “fixing the automated process that sends GitHub Pages sites for metadata analysis so that only public GitHub Pages sites are sent for analysis” as well as asking the 3rd-party to delete the data.
Seeing that GitHub can make the mistake of accidentally listing their internal urlscan.io scans publicly, I thought the site could contain potential for more data leaks.
As I hadn’t heard of the site before, I decided to check it out.
urlscan.io describes itself as “a sandbox for the web”, where you can submit URLs which are then analyzed and scanned in various ways, mainly to detect malicious websites such as phishing sites. Besides analyzing the URLs submitted via the website, urlscan.io also scans URLs from public data sources and provides an API to integrate the check into other products. This last option is the one that lead to the systematic data leak of private repository URLs by GitHub.
At the time of writing, the landing page listed 124k public, 76k unlisted and 436k private scans performed within the last 24 hours.
It also includes one of those “recent scan” views that is typical for that kind of security scanning sites (e.g. compare with https://www.ssllabs.com/ssltest/).
What is however more surprising is the option to search through all historical data (as an unauthenticated user) using the extensive ElasticSearch Query String syntax. This was also mentioned in GitHub’s notification mail:
To view the name of the private repository on urlscan.io, you would need to have been looking at the front page of urlscan.io within approximately 30 seconds of the analysis being performed or have specifically searched using a query that would return the analysis in the search results.
For every scan result, the service provides a lot of information:
urlscan.io’s docs page lists 26 commercial security solutions by vendors such as Palo Alto, Splunk, Rapid7, FireEye and ArcSight that have integrated the service via its API. GitHub, which is directly using this API internally as part of their SaaS offering, is missing from this list however, as are probably many more enterprise customers.
If any of those tools/API users are accidentally performing public URL scans, this could lead to systematic data leakage. As those advanced security tools are mostly installed in large corporations and government organizations, leaked information could be particularly sensitive.
Besides commercial products, the integration page also lists 22 open-source projects, some of which are information gathering tools, and others are simple library implementations for easier querying of the API.
With the type of integration of this API (for example via a security tool that scans every incoming email and performs a urlscan on all links), and the amount of data in the database, there is a wide variety of sensitive data that can be searched for and retrieved by an anonymous user.
Please find below a collection of clickable “urlscan.io dorks” and (redacted) example results (please note that after reporting our findings to urlscan.io, they have added deletion rules for many of the below dorks so you might need to get a bit creative with the queries yourself):
Interestingly, when I performed my initial search back in February, I could find a lot of juicy URLs for Apple domains:
It seems like this information has in the meantime been hidden or deleted from the database:
However, when continuously monitoring the above result page, sometimes some fresh additional entries can be spotted, which disappear again within around 10 minutes.
We later found out that Apple has in the meantime requested an exclusion of their domains from the scan results, which is implemented via periodically deleting all scan results matching certain rules.
Overall, the urlscan.io service contains a trove of sensitive information of various kinds, that can be used by hackers, spammers, or cyber criminals for example to take over accounts, perform identity theft, or run believable phishing campaigns.
We can see from the details of a scan result whether a scan was submitted via the API, but we can not find out which application or integration submitted the scan request.
For this, we had two options:
We decided to do both and started contacting users while also collecting a list of scan results to send to urlscan.io.
We sent 23 mails to individuals whose email addresses were leaked in API-started scans (15 were from unsubscribe links, 5 from PayPal invoices, 2 from password reset links, and 1 from a PayPal claim).
At the end of the notification mails were "bait links" with unique UUIDs to test whether any URLs will be auto-submitted to urlscan.io. If that was the case, the unique token would allow us to associate the scan request back to a mail recipient.
Out of the 23 test links that we sent out, 9 (~40%) were submitted via the API to urlscan.io (most of them immediately after the corresponding mail was sent):
Without counting PayPal invoices (which might have actually been created as part of a scam campaign), the success rate is 9 out of 18 (50%).
The next day we sent out another 24 mails to email addresses from leaked hubspot unsubscribe links with 12 of them (50% again) triggering a public scan.
Those "pingbacks" show us several things:
For organizations where we found multiple email addresses or systematic leakage, we also tried contacting the IT/security department directly.
Unfortunately, neither did we receive any answer on any of the data leak notification mails to affected individuals nor could we get any feedback from organizations that seem to have a systematic problem.
With one exception: After sending one person a DocuSign link to their work contract, their employer reached out to us, started an investigation and awarded a bug bounty. They found the source of the leak to be "a misconfiguration of [their] Security Orchestration, Automation and Response playbook for integration with urlscan.io, which was in active development".
Without much feedback from affected users, we also explained the situation to urlscan.io.
In case the integration developers are following urlscan.io's docs (at least with regard to the following section), it should be quite easy for the urlscan.io team to identify the source software of the scan requests:
Integrations: Use a custom HTTP user-agent string for your library/integration. Include a software version if applicable.
We therefore asked, whether it would be possible for them to generate a list of user agents that triggered the scans related to the people we contacted, as well as for the scans of our own bait links, and if they would share that list with us - which they did!
In general, the urlscan.io team was very responsive and offered to investigate and work together with us to improve the current situation.
Reviewing the list of user agents revealed that many API integrations do not follow the above recommendation with more than half of the scans having been started with a generic "python-requests/2.X.Y" user agent.
The two solutions that could be easily identified via the user agents were:
Further investigations on the API keys used to start scans with "python-requests" user agents revealed that many of them were also generated for PaloAlto's XSOAR (or "Demisto", its former name). Others were generated for:
-- MARKDOWN --
Security Orchestration, Automation and Response (SOAR) platforms allow organizations to write their own playbooks to connect different data sources with security tools and services. To ease development, the platforms offer integrations with several 3rd-party services, e.g. via [this XSOAR urlscan.io pack](https://cortex.marketplace.pan.dev/marketplace/details/UrlScan/). With this pack installed, a playbook could extract URLs from incoming emails and submit them to urlscan with the command `!url url=https://example.com using="urlscan.io"` with optional parameters for e.g. the timeout and scan visibility.
The visibility of such a scan is then dependent on
1. the parameters submitted as part of the command,
2. the integration-wide configuration, and
3. the visibility settings of the associated urlscan.io account/team.
Therefore, a scan can be be wrongfully submitted as public
1. in case of a programming mistake in a playbook or a misconfiguration of the urlscan.io integration or account visibility settings, as has happened to the company that leaked their employee's work contract, or
2. in case the integration itself has a bug that does not respect the user-chosen visibility, as [has happened to the PaloAlto XSOAR urlscan.io pack](https://github.com/demisto/content/pull/18816) (fixed on May 1st of this year):
-- /MARKDOWN --
The argument [ ...] had been configured to have a default value of public, and this argument overrides all other settings related to visibility. [...] Consequently, for all command invocations that did not explicitly provide a value for this newly introduced argument, all scans have since been executed with visibility public.
As a result of our findings, urlscan.io reached out to customers who they identified as submitting a significant amount of public scans and started reviewing popular third-party integrations such as for SOAR tools to ensure they respect the user intent with regards to visibility.
Furthermore, they implemented the following changes:
urlscan.io also published a blog post titled "Scan Visibility Best Practices" that explains scan visibility settings, encourages users to frequently review their submissions and details urlscan's efforts to prevent such leaks.
In case sensitive data is still leaked, they offer the following options:
While passively searching through historical urlscan.io data already uncovers a trove of sensitive information, combining it with "active probing" can greatly increase the impact of any small leak.
As we have seen earlier, for around 50% of the users that have an unsubscribe link leaked in urlscan.io, any link in any incoming email will be immediately submitted to urlscan as a public scan.
We can find those misconfigured clients by scraping urlscan.io for email addresses (e.g. in unsubscribe pages or even just in the URL itself), sending them a "bait link" via mail and subsequently checking urlscan.io for the link.
By actively triggering password resets for the affected email addresses at various web services such as social media sites, other email providers or banks, and then checking for recent scan results for the corresponding domains in the urlscan database, we can exploit this behavior to take over those user's accounts. Searching additional data leaks e.g. using HaveIBeenPwned for those email addresses might also provide hints which services the users are registered at.
For company email addresses, custom login portals and popular enterprise SaaS can be particularly interesting to find existing accounts at and trigger password resets for.
And even if no account exists yet at a specific service, just creating a new one with a corporate email address might provide access to the company's internal data stored on that service.
The following graphic illustrates all the necessary steps to perform the account takeover attack:
As owner of a web service, you can make sure that you’re expiring password reset and similar links quickly and are not leaking unnecessary information to unauthenticated users via links that could become public. On an unsubscribe page, redact the user’s email address, and ask for additional authentication/information before showing PII (like many package tracking websites nowadays ask for a ZIP code before showing the full address). When implementing API authentication, do not accept API keys via GET parameters and instead require the usage of a separate HTTP header.
Furthermore, you can also search urlscan.io as well as other services yourself for any data leaks regarding your own web service or organization, request deletions/exclusions and e.g. disable and rotate any leaked API keys of your users.
urlscan.io users/security teams that integrate the service should review their command, integration and account visibility settings, keep their integrations up to date, regularly review their submitted scans and check the urlscan.io blog post for more information.
We have shown that the service urlscan.io, which usually helps protect users, also stores sensitive information of those users, some of which is publicly available and can be searched through by attackers.
This information could be used by spammers to collect email addresses and other personal information. It could be used by cyber criminals to take over accounts and run believable phishing campaigns. And it could also be used by red teamers and security researchers to find hidden admin portals, gain a first foothold or find potential targets.
Similar to Google Dorking or the search through public S3 buckets, “urlscan.io mining” reveals a lot of information that was not meant to be public. The difference is, that the information that can be found via Google or that is in a public S3 buckets actually is already public, while the URLs submitted to urlscan.io for public scanning might contain authentication tokens and originate from a private email to an individual, submitted by a security solution. It’s a case where the introduction of additional security tooling can actually degrade a system's security.
The docs do warn of personally identifiable information in the submitted URLs, but only suggest to mark those scans as Unlisted (instead of Private):
TAKE CARE to remove PII from URLs or submit these scans as Unlisted, e.g. when there is an email address in the URL.
Those unlisted scans are at least still “visible to vetted security researchers and security companies in [the] urlscan Pro platform”. Furthermore, this warning does not address the risk of PII in the returned page instead of in the URL itself.
The pricing and API quotas also heavily favor the use of public scans and the service does not (effectively) proactively prevent the leakage of PII (e.g. a simple search for page.url:@gmail.com returns the maximum number of 10000 results).
However, the urlscan.io team has been very responsive to our report, supported the investigation, published a blog post to educate their users and implemented software and process changes to reduce the number of leaks.
-- MARKDOWN --
`2022-02-15`: Github notified affected users of the private repository name leak via email
`2022-02-16`: Initial exploration of the urlscan.io service
`2022-07-05`-`2022-07-15`: Continued analysis, scraped email addresses, sent notification mails
`2022-07-15`: Reported findings to urlscan.io and shared blog post draft
`2022-07-15`-`2022-07-20`: Discussed and investigated findings together with urlscan.io
`2022-07-19`: urlscan.io released new version with improved scan visibility UI and team-wide maximum visibility setting
`2022-07-27`: urlscan.io published blog post on [Scan Visibility Best Practices](https://urlscan.io/blog/2022/07/27/scan-visibility-best-practices/)
`2022-11-02`: We are publishing this blog post
-- /MARKDOWN --