Image in image block

Attacking AI Video Processing

AI is processing videos to digest the content, and providing summaries of the video, threat detection, behavioral tracking, medical rehabilitation analysis, to name a few. A variety of information can be inferred from videos including object classification, audio transcription, subtitle streams, optical character recognition, and visual comparisons. The attack surface and potential vulnerabilities increase as more types of information are inferred from video content.  

We will consider the following inferred data types:

  • Transcribing audio into text
  • Segmenting into scenes or chapters
  • Identifying objects
  • Summarizing the output of the above data types

This blog will demonstrate how to elicit LLM confusion, information disclosure, resource exhaustion and service crashes by crafting videos with prompt injections, edge cases in streams and their properties, and simulating transmission errors. Tools are presented that tailor the videos to the system under test. We can loosely call this “fuzzing”. We will create videos to attack these systems and recommend mitigations.

Data Types

The current capabilities of AI video processing enable the extraction of the following data types from video:

  • Scenes identified by start and end time markers
  • Object classification in selected frames
  • Frame embedding using a vision model
  • Transcription of audio channel
  • Subtitle streams
  • Optical character recognition (OCR) of selected frames for any on-screen text
  • Frame caption using a caption model

There is a lot of information that can be extracted from a video. A typical architecture is to use a “pipeline” that breaks the process into components. Each component of the pipeline is specific to the type of data it is extracting. Toward the end of the pipeline, the data from the several components are assembled into a final output. We'll assume this approach.

Video Properties

There are numerous properties of a video that can affect the memory, disk and compute required to extract the data types. These properties also affect the desired outcome of the service by exercising edge cases or invoking error conditions that the development team did not consider.

Containers

Video containers define how the video, audio and subtitle streams are stored. The most popular container formats currently are MP4, MKV, and MOV. The software consuming the video will have a list of supported containers. Knowing this is important to properly craft malicious videos.

Frame Rate

The frame rate is the number of frames per second (FPS). Typical frame rates are 24, 30 and 60. Videos capturing action, such as sporting events, may be 120 FPS. The frame rate isn't limited to these values but there are historical reasons why they are common. The file size of the video is directly proportional to the frame rate. In our testing we can leverage an arbitrary frame rate to affect the video size and memory usage as described later on in this post.

In the description of data types above, there is a qualifier of “selected frames”. A 24 FPS video that is 30 minutes long will have 43,200 frames. That is a lot to process and mostly unnecessary because so many frames are similar to their adjacent frames, with the same scene. The processing software will select a subset of frames to use with OCR, object classification or a caption model.

Video Codec

The video codec determines how the individual frames of the video are represented. Nearly all codecs incorporate some form of compression. The general approach is to identify key-frames that are fully included in the video. Frames between key-frames only contain the visual difference from the previous frame. This approach is space-efficient because motion can be evoked by the gradual changing of many frames over time. If the difference between frames is small, then the total video size is reduced. This will be important later.

Audio

Audio streams have a sampling rate typically measured in kilohertz (kHz). The audio sample rate describes a similar measure to the frame rate but is independent of the video stream. DVD quality audio is typically sampled at 48 kHz. There is also the bit rate, usually measured as kilobits-per-second (kbps). Some typical MP3 bit rates are 128 kbps and 160 kbps. The sampling rate and bit rate directly affected the size of the audio channels. Again, we can leverage this to our advantage.

An audio stream may have multiple channels. Stereo audio has two channels. Surround sound has a variety of configurations. 5.1 is a typical configuration which specifies six channels, the ".1" identifies a subwoofer channel. The way in which the processing software handles multiple channels is of interest to us. Does it down-mix to a single channel, or transcribe all channels separately? How are multiple transcriptions represented to the AI?

Subtitles

Subtitle streams can either be text or image based. Each string of text or image has a start and end time associated with it. Video discs typically have image-based subtitles, for which OCR is useful. There are many text-based subtitle formats. Some include markup features to specify fonts, bold, italics, motion, animation and scripting. The process of adding subtitles to the video frames is called “burning in”. Image-based subtitles are merged on top of the video frames and provide little flexibility to the player. Text-based subtitles allow the player more flexibility with presentation.

Attacks

We’ve covered the video properties that are important to our attacks. Let's look at how we can leverage unusual values to attack the AI processing pipeline.

LLM Confusion

Let's start with confusing the text-based large language model (LLM). As stated, the pipeline will have multiple components to extract the data from the video. The data needs to be presented to the LLM in a textual form to perform the analysis.

We begin with an example of a prompt template for the LLM to summarize a video: 
 
Create a concise, coherent summary of the video based on the scene transcripts and visual cues below. 

Title: {{title}} 
 
{% for scene in scenes %}   
 
Scene ({{scene.start}}-{{scene.end}} s): 
TRANSCRIPT: {{scene.transcript}} 
SUBTITLE: {{scene.subtitle}} 
CAPTION: {{scene.caption}} 
OBJECTS: {{scene.object_classifications}} 
OCR: {{scene.ocr}} 
   
{% endfor %} 
 

That's a good amount of information the LLM has to process. For a normal video, such as a patient interview or brief clip at the zoo, the LLM will infer a good idea of what's going on, recurring themes, etc.

An important control we need to consider are guard rails. Guard rails are a type of output validation for LLM based systems. LLMs are probabilistic systems, so the output for a given input can change. A typical guard rail is that the AI should not tell the user how to conduct illegal activity.  

Development teams spend most of their effort on the expected input because that brings the most value to customers. A valid expectation is that a non-malicious video will have consistent content in each of the data types. 

What are the consequences of unexpected data from an LLM perspective?

  • Where are the guard rails applied?
  • Are there components that don't have guard rails because it's assumed that another component's guard rails will catch undesirable content?
  • What if the on-screen text, transcript and subtitle say completely different things?
  • If the frame content passes the guard rail, but the transcript is nefarious and doesn't match the visual content at all, will the transcript be censored?
  • Vice-versa, if the transcript or subtitle is acceptable, will undesirable visuals be accepted?

What LLM injection scenarios are present? The template above is simplistic and not intended to show a production-ready implementation.

  • Is ALL of the content guarded? For example, can OCR be used for prompt injection whereas the transcript or subtitle are not viable?
  • Is it possible to create a visual that produces an LLM instruction in the caption?

The length of text may be used to overwhelm the LLM context and inject instructions. For example, common text-based subtitle formats have no character limitation.

Resource Exhaustion

Resource exhaustion refers to overwhelming the memory, disk and/or compute of the processing pipeline to degrade service.

Video file sizes can easily grow into the hundreds of megabytes or gigabytes by normal recording devices. The first thought for protecting the pipeline from resource exhaustion is to limit the accepted file size. However, we'll see this isn't enough.

There are several areas where we can fuzz the LLM to attempt resource exhaustion. Let's examine the example summary template given above.

First, the scene count can be artificially inflated. Scene detection can be complicated. At a high-level, it looks for sufficient differences between a particular segment of video, and the preceding/following segments. Audio may also be considered in the detection by looking for periods of silence and other noticeable volume changes. If we can generate a video with a lot of scene changes, it may result in resource exhaustion. One method is to produce a slide show video, where each scene is one image repeated for many frames. The video codec will compress this considerably, allowing us to fit hundreds or thousands of scene changes in a video that fits within any file size limitation enforced by the pipeline.

Object detection can be abused by creating frames with more objects than the pipeline is designed to process. Examples of objects are vehicles, animals, and buildings. The number of objects considered to be too many may be in the tens or hundreds. The model used to detect objects is important because there may be a minimize width and height requirement per object.

As previously stated, most text-based subtitle formats do not have a character length limitation. There is an effective limit when rendered on the screen, but the LLM prompt does not have the same limitation. The pipeline may extract the subtitle as-is and add to the template. This could produce very long text.

For many components in the pipeline to do their work, uncompressed frames are needed. Whether the frames are stored on disk or in memory, we can attempt to exhaust the resource.

Increasing video dimensions typically have an exponential effect on resource utilization. For example, the H.265 (HEVC) codec supports dimensions over 8192x4320 (4K video). Storing frames of this size will take considerably more space and compute than a 1080p video.

The frame rate may also impact the resource usage. For example, if the pipeline is sampling every 10th frame, we can generate a video with 200 FPS (or more). A space-efficient video codec such as H.265 will compress this considerably to reduce the total file size. When expanded, the frames will take considerably more space. Again, compute, memory and disk may all be affected.

Unexpected Errors

Timestamps

Timestamps are critical for the proper interpretation of video data. Modifying the timestamps to be out of order, very large, or possibly negative could adversely affected processing. This requires custom tooling as video processing software intends to produce valid videos.

Random Errors

When video is transmitted over USB cables or networks, there are error correction protocols in place to ensure the data is not corrupted during transit.

This error-free assumption is valid in most settings. One case where this isn't a safe assumption is recording from broadcast television. In this medium, video is transmitted over the air from the station antenna to the receiver antenna miles away. Atmospheric conditions may introduce errors in the stream. There is no mechanism to request re-transmission, so the errors remain. We can leverage the error-free assumption by introducing artificial errors into the video. The errors may either be in key places or at random. Some containers and codecs are designed to be resilient to a small percentage of errors.

Conclusion

There is a lot of information to be gathered from a video. This is beneficial for users since video is easy to capture, and services provide ways to understand the data quickly and thoroughly. The attack surface increases with the amount of data gathered. Our testing needs to fully explore these threats to protect our customers and users.

In a following post we'll look at how to use open-source tools to generate videos with fuzzed parameters.

 

More information

Discover how cyber experts like Patrick Double, Security Engineer and author of this AI Video Processing article, can help secure your organization with AI Security Services. Fill out the form, and we’ll contact you within one business day.

USP

Why choose Bureau Veritas Cybersecurity

Bureau Veritas Cybersecurity is your expert partner in cybersecurity. We help organizations identify risks, strengthen defenses and comply with cybersecurity standards and regulations. Our services cover people, processes and technology, ranging from awareness training and social engineering to security advice, compliance and penetration testing.

We operate across IT, OT and IoT environments, supporting both digital systems and connected products. With over 300 cybersecurity professionals worldwide, we combine deep technical expertise with a global presence. Bureau Veritas Cybersecurity is part of the Bureau Veritas Group, a global leader in testing, inspection and certification.