The activity called Web scraping refers to the process of automatically extracting information from Web pages using special software. The operation makes it possible to retrieve structured or unstructured data from online sites, in an automated way. An activity that can be useful for public data collection, research, price tracking, content aggregation, competitive analysis and more. But Web scraping can also feed data sets to artificial intelligence generative models with the purpose of training them and answering user questions.

Web scraping involves the use of special software that analyzes the source code of pages by collecting data of interest and sometimes the entire structure of the site. It is a very versatile procedure and in addition to search engine indexing, it is also used for other purposes:

  • Creating contact databases;
  • Monitoring and comparing prices of online offers;
  • Combining data from different online sources;
  • Tracking online presence and reputation;
  • Collection of financial, weather and other data;
  • Monitoring web content for possible changes;
  • Collection of data for research purposes;
  • Performing data mining.

Some Web scraping examples

Examples are numerous: for example, search engines, such as Google, which monitor millions of web pages to collect words, phrases, images, videos and all useful information to provide users with more complete and accurate results. Or price comparison tools that help users find the best deals on certain products or services. Monitoring job ads from multiple websites at once and collecting e-mail addresses for marketing purposes.

Web scraping is legal insofar as you are going to acquire free and public data, not protected by copyright. If anything, the point of discussion concerns the use of the data, which is often sold to third parties in order to create tailored scams and customized spam campaigns. Web portal operators have the right to install appropriate protective measures, but they are often circumvented illegally, violating the terms of service.

It is not possible to constitute a telephone directory other than the DBU. Prior consent is required to use telephone numbers. Contact data may not be collected by web-scraping and the right of users to unsubscribe must be ensured

The measure of the Guarantor

The Privacy Guarantor, however, is very attentive to the problem and recently intervened with a measure against the owner of the website trovanumeri.com: stop the constitution and online dissemination of a telephone directory formed with data through Web scraping and an injunction to pay a penalty of 60 thousand euros. In fact, the current regulatory framework does not allow the creation of generic telephone directories, which are not extracted from the single database (DBU) of telephone numbers and customer identification data of all national fixed and mobile telephone operators.

The Authority's findings showed that the site owner did not have a proper regulatory basis for processing the data; there were no directions on the site for contacting the data controller, and there was no possibility of obtaining data deletion if the appropriate form did not work. The brief privacy notice published also did not indicate the owner of the site, the identification of which required lengthy investigation. The Garante therefore declared the collection, storage and publication of personal data unlawful.

Numerous have been in recent years the appeals to the courts and requests for intervention received by the Guarantor concerning the unauthorized publication of names, addresses, telephone numbers, even of holders of confidential utilities.

In Italy, for example, another well-known case of alleged Web scraping concerns the 2019 lawsuit that Trenitalia has filed against the British company Gobright Media Ltd, the producer of Trenìt, an app that allows users to compare high-speed train fares. The focus of the dispute is data and its license to use it: Trenitalia, in fact, accused the British company of improperly using its database, accessing information such as train traffic management, cost of tickets, timetables, delays, etc... The Rome court first ordered Gobright to cease its Web scraping activity and later authorized the activity, because it did not realize substantial data misappropriation.

Widening the focus, other well-known cases concerning illegal Web scraping by companies that abuse and violate terms of service or copyright regulations should be mentioned.

In a ruling by the U.S. Ninth Circuit Court of Appeals, LinkedIn filed suit to prevent a competitor, HiQ, from scraping personal information from users' public social network profiles. In 2020, the ruling determined that the CFAA law had not been violated because the LinkedIn data being scraped was public (not password-protected).

Another case in the headlines involves Clearview AI: the facial recognition company received a hefty fine for scraping millions of photos of people's faces taken from social media. Clearview AI processed sensitive data without a valid legal basis.

In the trovanumeri.com case, the Guarantor thus reiterated some important principles: those who entrust their contact information to the web, have purposes that are not necessarily to receive marketing communications or see them indexed and further disseminated. Collecting contact data to form lists to be used later for marketing purposes is unlawful. So is disseminating such data in the form of a list

In setting the amount of the fine, the Authority took into account the seriousness of the violation, the large number of subjects whose data was published (approximately 26 million users), the duration of the violation, and the willful nature of the conduct of the holder.

How to defend against web scraping

But is it possible for users to defend themselves against web scraping?

First of all, restricted areas can be created on websites that can only be entered through registration, as is the case on social networks, which have different levels to take advantage of certain content. Or anti-bot services, robots.txt files, or blocking IP addresses of bots can be used. However, it is very important to provide in the terms of service (TOS) of a site an absolute prohibition to use scraping techniques for systematic retrieval of data and information.​​