With the television media continuously cutting into other forms of advertising exposure, the costs to be on TV are increasing continuously. Like the print media makes revenues on “space” the TV media sells “Time” on broadcast stream. Advertisements are booked on length of time slots and hence advertiser is keenly interested to know if the promised time slot was indeed allocated to them.
There are thousands of television channels around the world, more than thousand advertising agencies per television channel and more than thousand clients per advertising agency. This gives the magnitude of people involved and interested in validating a broadcast stream for the content. Often the need is to ensure “What You Get Is What You Pay”.
Some time the advertiser also wants to align broadcast on all television channels to make it look like simultaneous broadcast that is difficult to validate.
A Television Broadcast Monitoring Agency (TVBMA) can set up the systems to not only identify broadcast in seconds but the currency should be “number of frames” that are transmitted. This gives them accurate measurement.
This paper describes a method for bringing two videos (recorded at different times) into spatiotemporal alignment, then comparing and combining the corresponding pixels for applications such as background subtraction, compositing, and increasing dynamic range. We align a pair of videos by searching for frames that best match according to a robust image registration process. This process uses locally weighted regression to interpolate and extrapolate high likelihood image correspondences, allowing new correspondences to be discovered and refined. Image regions that cannot be matched are detected and ignored, providing robustness to changes in scene content and lighting, which allows a variety of new applications.
From given multiple still images of a scene from the same camera center, one can perform a variety of image analysis and synthesis tasks, such as foreground / background segmentation and constructing high dynamic range composites. Our goal is to extend these techniques to video footage acquired with a moving camera. Given two video sequences (recorded at separate times), we seek to spatially and temporally align the frames such that subsequent image processing can be performed on the aligned images. We assume that the input videos follow nearly identical trajectories through space, but we allow them to have different timing. The output of our algorithm is a new sequence in which each frame consists of a pair of registered images.
The primary difficulty in this task is matching images that have substantially different appearances. Video sequences of the same scene may differ from one another due to moving people, changes in lighting, and/or different exposure settings. In order to obtain good alignment, our algorithm must make use of as much image information as possible, without being misled by image regions that match poorly.
Traditional methods for aligning images include feature matching and optical flow. Feature matching algorithms find a pairing of feature points from one image to another, but they do not give a dense pixel correspondence. Optical flow produces a dense pixel correspondence, but is not robust to objects present in one image but not the other. Our method combines elements of feature matching and optical flow. In a given image, the algorithm identifies a set of textured image patches to be matched with patches in the other image. Once a set of initial matches has been found, we use these matches as motion evidence for a regression model that estimates dense pixel correspondences across the entire image. These estimates allow further matches to be discovered and refined using local optical flow. Throughout the process, we estimate and utilize probabilistic weights for each correspondence, allowing the algorithm to detect and discard mismatches.
Our primary contribution is a method for spatially and temporally aligning videos using image comparisons. Our image comparison method is also novel, insofar as it is explicitly designed to handle large-scale differences between the images. The algorithm cannot align images from substantially different viewpoints, partially because it does not model occlusion boundaries. Nonetheless, we demonstrate a variety of applications for which our method is useful.
The proposed set up will digitize a maximum of 100 television channels simultaneously. The digitalized stream is broken down to an individual frame level. The software will match these frames with the library of pre-received (historical) transmission.
The challenges are robustness to channel and encoding noise, encoding rate, and reediting of the advertisements. Using a color-based visual feature as a basis for describing the frames of the video, a sub-shot segmentation of the video that is consistent across encoding rates and extensible to streaming media is produced. This segmentation is subsequently utilized in a similarity matrix based matching algorithm that effectively matches temporally re-encoded and re-edited videos. Experimental matching results for the situation of discrete video files are given. The method is developed to be fully extensible to any domain with a continuous media stream.
The technical demands on such an automatic advertisement monitoring system are considerable. The detection of advertisements must first of all be robust to the specific characteristics of the broadcast channel that is being monitored. Additionally it must be robust against various analog television standards (e.g. SECAM, PAL, NTSC). The above requirements can be considered common to any video copy detection system since a video can undergo an arbitrary number of signal re-encodings but nonetheless be expected to match to its original.
For the purposes of advertisement identification and market research in particular the system also needs to be able to match temporally modified (i.e. re-edited) versions of advertisements that may contain inserted or deleted video material. This means that if an advertisement has had a section inserted or removed for the purposes of modifying its length, it should still match back to its original version.
The notion of temporal tolerance that we consider is twofold. We aim for resilience in matching against versions encoded at different rates such as PAL (25fps), NTSC (30fps).
The problem of matching two visual feature sequences stemming from videos encoded at different frame rates is obvious. In a purely sequence based matching approach such as convolution, features become quickly misaligned. As the features compared are not paired correctly, a low matching score is thus reported for two videos which otherwise represent the exact same visual content.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30………..30 FPS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25…………………………..25 FPS
Frames which optimally match are every 6th frame from the 30fps version to every 5th frame from the 25fps version, thus for every 5 frames there accumulates an “off by one” error in the comparison of the sequences. Matching reedited versions of advertisements back to their originals is the general issue of matching with sections co-derivative videos. We consider co-derivative videos to be unique versions of a single video production (e.g. television, theatrical, and trailer versions of a film would be considered co-derivative). In the domain of television, advertisements may typically be lengthened or shortened by inserting or deleting part of the video material.
In this proposal we propose some of the fundamental technologies of such an advertisement monitoring system based on content-based visual analysis. Using visual features extracted from each frame of the video, a feature sequence is produced and segmented at the sub-shot level.
The sub-shot segments are then compared using the key-frames of each segment. Videos are matched by scoring a similarity matrix of matched segments. This allows for the matching of videos even with inserted or deleted frames.
The handle - Color
As colors are the most often affected aspect of the signal, they find that an edge representation of the frame offers good robustness against channel noise.
The color coherency feature is deployed to calculate a feature for every frame in a video sequence.
Subsequently, the problem is as simple as that of matching two feature strings, whereby we apply the approximate sub-string matching algorithm. Since the approximate sub-string matching algorithm is used rather than the approximate string matching algorithm, the method is robust to some temporal variation in the clips such as dropped leading or ending frames.
The ability of any system to exactly identify a video sequence directly hinges on the underlying type of feature chosen to represent the visual content at the frame level. Therefore, the choice of features is crucial towards developing a successful matching algorithm since its performance and robustness largely depends in turn on the specific properties of the selected feature.
Reducing color information
An important source of information in visual content for recognition and discrimination is that of color. However, the amount of color information in video material is vast. In order to make the amount of information manageable, the video’s raw digital data has to be transformed into compact feature representations that convey only the most salient color aspects of the visual content. The similarity between visual content is then determined by some well-defined similarity measure in feature space.
Color histograms are the most widely used color feature representations. In essence, any color histogram algorithm follows a similar progression:
- Selection of a suitable color space
- Selection of an appropriate quantization strategy of the color space for the computation of the histograms
- Derivation of a meaningful histogram distance function
Color histograms are independent of the size and resolution of the video’s frames because information is processed regardless of spatial context. Furthermore, they are partially reliable for recognition purposes even in the presence of small variations in the frame’s visual appearance (i.e. noise, shifts in hue, luminance and/or saturation). As a starting point, a color histogram feature was selected because of its properties with respect to the robustness requirements. The HSL color system was chosen as the underlying color space as it describes color in terms of hue, saturation and luminance directly. The HSL is divided into 29 selected color bins provide an adequate representation of the color space. In this way it is ensured that a small variation in color as a result of broadcast or encoding noise impacts the histogram only slightly.
Because the video database may be large, a limited amount of processing time can be afforded for a single comparison. The selected comparison method is known as histogram intersection and can be computed very efficiently.
To segment videos, we first extract the visual features frame by frame to produce a sequence of features for the video. This feature sequence is then segmented at the sub-shot level such that roughly the same segmentation is produced regardless of encoding rate. This is accomplished using a bottom-up segmentation algorithm. In the first iteration of the algorithm, each frame is considered its own segment. In subsequent iterations, temporally adjacent segments are combined if, a merging criteria is fulfilled. The algorithm continues until no more segments fulfill the merging criteria.
Adjacent segments are merged if their distance in feature space is below a set threshold. An appropriate value for this threshold naturally depends on the underlying feature being considered. Since the segmentation depends on the visual feature being used, the segmentation that we generate based on the color histogram represents just one possible segmentation of the video. Other segmentations based on motion, edges, or regional visual characteristics will generate correspondingly different segmentations.
The approach is as follows:
- Capture the raw Video Broadcast Stream
- Digitalize it based on parameters without compression
- Identify compression frames (Frames that you want to ignore)
- Capture the VDO and Audio spectrum
- Store the signature
- Signature frame comparison
- Video Image Processor (VIP)
Capture the VDO stream
All the television channels will have to be captured through satellite dishes oriented in the right direction at the correct angle facing the transmitting satellite. The received signals from these antennas are modulated to integrate them onto a single stream.
The stream is the split into individual broadcast and fed into high quality digitizer. The digitizer is typically a high MHz clocked PC with 512MB or more RAM, loaded with applications to store the data on a tera-byte server.
The DSR feeds into to a PC that will have a tuner card and a high to medium resolution encoding card. The encoding card will encode in Real, Windows, QuickTime or MPEG formats, which will then be stored deploying high compression ratio.
Identify Compression Frames
The sampling theorem is used to sample at twice the rate of information transmission. A signature frame is extracted for every time period measuring little less than half the minimum time period. The viewer views these frames to differentiate between different television broadcast (a commercial content or non commercial) programs. On roll over on the signature frames, further granularity of broadcast will be visible.
There may be times where different versions of the same advertisement are run frequently. Each of these versions may occupy the same or different number of frames. The signature frame will clearly throw up these variations. The format of the report will be:
Video Image Processor (VIP)
The VIP computer delivers high-performance image processing to meet the demands of large-scale frame matching. The VIP is well suited for matching spectral colors, audio and other parameters to check symmetry and parametric alignment between frames.
The VIP computer has a image processing co-processor used for real-time. The VIP takes images produced from the encrypted or data-compressed byte stream to perform the matching.
The VIP can be networked to provide a cost effective means of sizing the image-processing throughput to meet both current and future system requirements. It offers 100/10BaseT Ethernet interface as well as fiber connectivity for high speed, and its software can be configured for specific image-processing tasks. The VIC automatically reports the health and status of the system via the network interface.
Our image alignment algorithm finds correspondences between pixels in a pair of images. Each correspondence is assigned a weight according to the likelihood that it describes a physical 3D point undergoing a physical 3D motion. The ability to characterize the correctness of a correspondence is essential to the robustness of the algorithm. We want to use as much information from the images as possible, but we do not want to be misled by unexpected differences between the images.
The weight assigned to the correspondence is the product of two terms: a pixel matching probability and a motion consistency probability. For simplicity, we assume independence when combining the probabilities.
To compute the pixel matching probability, for a particular correspondence, we evaluate how well the images match in a square region around the correspondence. Rather than simply comparing pixel values, we use a method that allows small spatial variations in the corresponding pixel locations.
A single pixel in the primary image is compared with a 3-by-3 neighborhood of pixels in the secondary image, rather than with a single secondary pixel. To do this efficiently, the algorithm applies 3-by-3 minimum and maximum filters to the secondary image, producing new images. These minimum and maximum images define bounds on the value of each pixel in the secondary image; the corresponding primary pixel receives a penalty if and only if its value lies outside this interval.
To evaluate a correspondence, our algorithm sums this pixel matching score across a square region.
Motion Regression and Consistency
To evaluate motion consistency, we determine how well the offset vector of a particular correspondence agrees with its neighbors.
Frame Matching Measure
To evaluate the quality of a match between a pair of frames, we use the robust image alignment method to find a correspondence between the frames, and then use it to estimate how well the primary and secondary frames match. Our frame matching objective function has two parts: a parallax measure and the correspondence vector magnitude. We minimize parallax because depth discontinuities will cause errors in the reconstructed correspondence field. Correspondence magnitude is less important, but we nonetheless minimize it to obtain maximal overlap between the frames. Given a pair of correspondences, we compute the distance between the points in the primary image and the distance between the points in the secondary image.
Adaptive Search for Matching Frames
Using the objective function we wish to search the secondary video for a good match to a particular frame in the primary video.
Given some initial guess of where to look in the secondary video, our algorithm evaluates several nearby frames and fits a quadratic regression model to the objective function values of these pairings. These preliminary evaluations occur at the initial guess, 1 frame forward, 1 frame backward, 5 frames forward, and 5 frames backward. Once all secondary frames near the quadratic minimum have been checked, the algorithm picks the one with the lowest objective function value. In order to compute an initial guess for the next frame search, the algorithm computes a weighted average of the changes between the frame indices of the prior matches. For the first frame of the primary video, we have no previous evidence for where to look in the secondary video.
We do not need to know the particular frame that will match best, but we need a good enough guess to initiate the quadratic search. This initial guess can be provided by the user or found automatically by a linear search of the secondary video. This search method allows substantial flexibility in the temporal mapping from one video to the other. One video can be much faster than the other or proceed in the opposite direction. The videos can change speed and relative direction, so long as the changes are smooth.
The advanced algorithms used for Meta Tagging helps in reducing the manual effort. Although concepts like voice recognition, image recognition or key differentiator between image scenes are known to expedite Meta Tagging. These technologies are yet on the test bench and will be a while before we can realize it. As of now, manual Meta Tagging is one of the safer solutions and we can only reduce the time and effort by employing advanced algorithms.
The user will start Meta Tagging by clicking on a frame, which could be beginning of an advertisement and end of the advertisement. The time stamps at both the places are captured and a row inserted into the database.