Video

How high-quality datasets improve LLM training

In this interview, Paolo Budroni, AI Project Manager, explains why high-quality datasets are essential for reliable LLM training. He discusses the importance of collecting diverse and representative data, removing duplicates, spam and toxic content, and applying normalization, anonymization and source balancing. The interview also challenges the assumption that more data is always better, showing how smaller, carefully curated datasets can outperform larger but noisy collections. Finally, Budroni highlights privacy, fairness and regulatory compliance, including GDPR and the EU AI Act, as fundamental elements of ethical dataset curation and user trust.

Watch the video ↓

 

You might also be interested in