Web and online data: scraping, terms and compliance
The open web is a vast data source, but collecting from it raises real legal and ethical questions. This guide covers web and online data, the compliance landscape, and how to source it responsibly.
What web data offers
Web data, prices, listings, reviews, content, availability, provides timely signals about markets, competitors and behaviour. It powers price monitoring, market research and many alternative-data products.
The compliance landscape
Collecting web data is not automatically permissible. Terms of service, copyright and database rights, personal-data rules, and computer-misuse considerations all apply. Just because data is visible does not mean it is free to take and reuse.
Personal data on the web
Publicly visible personal data is still personal data under the GDPR. Sourcing it requires a lawful basis and respect for individuals rights, and aggregation or anonymisation is often appropriate.
Quality and reliability
Web data is messy and changes constantly: page structures shift, content varies, and coverage is uneven. Robust collection includes validation, change detection and clear documentation of method.
Sourcing responsibly
Responsible sourcing means respecting terms and law, preferring licensed or official feeds where available, and documenting the basis for collection. Where official APIs or licensed datasets exist, they are usually the better route.
In a managed model
A managed partner can assess the legal basis, prefer licensed sources, and deliver web-derived data with documented provenance, reducing the buyer risk.
Visible does not mean free to use
The central misconception about web data is that anything publicly visible can be taken and reused. In reality, terms of service, copyright and database rights, computer-misuse rules and, for personal data, the GDPR all apply. Public personal data is still personal data. Responsible sourcing assesses the legal basis for collection, prefers official APIs or licensed feeds over scraping, and documents the basis, because a defect at collection becomes the buyer’s risk downstream.
Quality and change
Web data is messy and unstable: page structures shift, coverage is uneven, and content varies. Robust collection includes validation, change detection and clear documentation of method, and treats representativeness with caution. Where licensed or official sources exist, they are almost always the better route than scraping at scale.
- Web data offers timely market and behavioural signals.
- Visible does not mean free: terms, copyright and privacy all apply.
- Public personal data is still personal data under the GDPR.
- Prefer licensed or official feeds; document the collection basis.
Sources & further reading
- EUR-Lex: Directive 96/9/EC (database rights) and Directive (EU) 2019/790.
- EUR-Lex: Regulation (EU) 2016/679 (GDPR).
- European Data Protection Board: guidance on publicly available data.
- Court rulings on web data and terms of service.
We assess the legal basis, prefer licensed sources, and deliver web data with documented provenance. Get a no-obligation quote.