Symantec Research Labs (SRL) have discovered that the good intentions of a site like urlscan.io can lead to the exposure of sensitive, private information of individuals, employees, and organizations.
urlscan.io has a publicly available index of security scans they run on all the URLs that have been publicly submitted to their service by users around the world. The intention of the scans is to find and reveal domains that are used by malicious actors. In their About page, urlscan.io describes themselves as "a widely-used tool for security professionals and amateurs to investigate possibly malicious pages, such as phishing attempts or pages impersonating known brands." While the site offers users the possibility to run scans privately, so that reports are only seen by URL submitters, many scans are run in public mode.
Symantec Research Labs conducted a study on 700 Software-as-a-Service (SaaS) domains found in urlscan.io's publicly available index. They made SaaS domains the focus of the study because organizations increasingly rely on SaaS to improve productivity with employees, as well as to ensure compliance with local and global data processing regulations.
SRL found that two Fortune 500 organizations (one from the Engineering sector and one from the Technology sector), one government organization, and multiple organizations from the financial sector were affected by exposed data. The types of exposed data found, ranged from personally identifiable information (PII) such as names and usernames or email addresses to confidential documents, such as contracts requiring digital signatures. Upon discovery of this information, the SRL team contacted the urlscan.io team who promptly removed the reports from the public index.
While some of the exposed data can pose an immediate risk (e.g. a sensitive document leaked via a screenshot), other exposures like PIIs could lead to future social engineering attacks. PIIs associated with a specific context (e.g. attendance of a person to an inauguration ceremony, purchase of a product/service as shown by an invoice or booking confirmation, etc.) are especially vulnerable. In some cases, SRL found that actions were taken on behalf of users when links were followed, like accepting an appointment or unsubscribing from a mailing list, in Web applications with limited or non-existent authorization mechanisms.
SRL found that two Fortune 500 organizations (one from the Engineering sector and one from the Technology sector), one government organization, and multiple organizations from the financial sector were affected by exposed data.
The 700 SaaS domains reviewed by SRL had at least one URL and associated artefacts (e.g. HTML document, screenshot of the rendered page) made available by the urlscan.io API, but some of these URLs were completely innocuous (e.g. home page). To identify those URLs exposing sensitive information, a semi-automated clustering and review approach was used. While a fully automated process would be ideal, SRL took on the manual part of the process to be sure the data collected in their research was as accurate as possible.
The heuristics they used to classify data allowed SRL to separate with high precision URLs associated with sensitive data exposure (e.g. https://mycust.saasexample.com/sign?projectId=4356&user=mc from innocuous URLs.
As a result, SRL identified about 600 URLs from 76 domains exposing various types of sensitive information. This shows that while most domains use various techniques to protect sensitive information from being exposed (e.g. strict authorization and authentication, use of re-captcha to detect automated HTTP GET requests, or obfuscating or masking PIIs by replacing some characters with other characters such as *), about 10% of the domains expose some information.
What Domain Owners Can Do
Ten percent is too many. Symantec Research Labs recommends the following measures to all domain owners to help protect PII.
- Web applications should assume that any personalized link they create could be exposed and offer relevant authentication and authorization checks.
- Users often use link sharing on SaaS providers for content without realising that content at the end of a shared link can be exposed on the Web. Solutions that provide secure link following should be used, while ensuring that privacy settings for these SaaS solutions are configured company-wide to safeguard data.
- Performing manual or automated link following (e.g. for security reasons) can have some serious consequences in terms of privacy if URLs are not sanitized before being checked and/or the target applications are too permissive. Companies using DIY automated URL lookup solutions can end up leaking sensitive data onto the Web.
- As much as possible, headless URL browsing should respect robots.txt directives (either at the site or document level), or at least do not share their browsing results publicly. Leaving traces of exposed data (e.g. confidential documents, PIIs) poses a significant privacy risk to users and organizations. It can also lead to data governance compliance issues.
- Using a product like Symantec DLP can help companies determine if links to sensitive content are being exposed outside the company.