Leveraging Advances in Privacy Training for On-device Language Models

Language models (LMs) which predict subsequent input text have become pivotal technology in various applications. In applications like Gboard, they are utilized to enhance user typing experiences, supporting features such as next word prediction (NWP), Smart Compose, smart completion, suggestion, slide to type, and proofread. Unlike deploying models on enterprise servers, training on-device models directly from user data on users' device holds several advantages such as lower latency and stronger privacy.

Over the years, Google's research advances, for example, from the conceptual development of Federated Learning (FL) in 2017 to the more recent Differential Privacy (DP) guarantees in 2022, have powered the private training of Gboard LMs. Perfectly, Differential Privacy is often characterized by (ε, δ) where smaller values represent stronger guarantees. Today, all NWP neural network LMs in Gboard are trained with Federated Learning with formal DP guarantees. Further, every future launch of Gboard LMs on user data will require DP guarantees. These numerous Gboard on-device LMs, launched in multiple languages and countries, are considered to have reasonable DP guarantees for ε=10 and strong DP guarantees for ε=1 when δ is small.

In the paper "Private Federated Learning in Gboard", different privacy principles currently reflected in production models were discussed. Some of these included transparency and user control, data minimization, data anonymization, and auditability and verifiability.

The private training of Gboard LMs has been a journey of constant research and development. Recently introduced techniques like the DP-Follow-The-Regularized-Leader (DP-FTRL) algorithm has improved privacy-utility-computation trade-offs by using public data, and tightening accounting. With these advanced techniques, a strong DP guarantee of ε ≤ 1 is not just possible but feasible.

Disclaimer: The above article was written with the assistance of AI. The original sources can be found on Google Blog.