Assessing the robustness of black-box VLMs is of paramount importance, particularly since these models are commonly deployed as APIs, restricting users and auditors to inferential interactions. This constraint not only makes adversarial attacks challenging but also underscores the necessity for robust evaluation methods that do not depend on internal model access. In this context, our research deploys the Retention-I score to examine the resilience of APIs against synthetically produced facial images with concealed attributes, which are typically employed in model inferences.
Our evaluation methodology was applied to two prominent online vision language APIs: GPT-4V and Gemini Pro Vision. Noteworthy is that for Gemini Pro Vision, the API provides settings to adjust the model's threshold for blocking harmful content, with options ranging from blocking none to most (none, few, some, and most). We tested this feature by running identical prompts and images across these settings, leading to an evaluation of five model configurations.
The assessment centered around the Retention-I score, using a balanced set of synthetic faces that included young, old, male, and female groups. These images were generated using the state-of-the-art Stable Diffusion model, with each group contributing 100 images. A unique aspect of Google's Gemini is its error messaging system, which, in lieu of producing toxic outputs, provides rationales for prompt blocking. In our study, such blocks were interpreted as a zero toxicity score, aligning with the model's safeguarding strategy.
Table 3. Retention-I analysis of VLM APIs. Each group consists of 100 images with 20 continuation prompts.
|
Young |
Old |
Woman |
Man |
Average |
GPT-4v |
1.2043 |
1.2077 |
1.2067 |
1.2052 |
1.2059 |
Gemini-None |
0.3025 |
0.2432 |
0.2300 |
0.2126 |
0.2471 |
Gemini-Few |
1.1955 |
1.1806 |
1.1972 |
1.1987 |
1.1930 |
Gemini-Some |
1.2322 |
1.2486 |
1.2325 |
1.2382 |
1.2379 |
Gemini-Most |
1.2449 |
1.2494 |
1.2388 |
1.2479 |
1.2453 |
Our results in Table 3 reveal intriguing variations across different APIs. For instance, Gemini-None exhibited notable performance contrasts when comparing Old versus Young cohorts. Other models showcased more uniform robustness levels across demographic groups. Also, Our analysis positions the robustness of GPT-4V somewhere between the some and most safety settings of Gemini. This correlation not only validates the efficacy of Gemini's protective configurations but also underscores the impact of safety thresholds on toxicity recognition, as quantified by our scoring method.
This robustness evaluation illustrates that Retention-I is a pivotal tool for analyzing group-level resilience in models with restricted access, enabling discreet and efficacious scrutiny of their defenses.