当前位置:首页 >> >> Multimedia Systems (2003) Multimedia Systems Digital Object Identifier (DOI) 10.1007s00530-

Multimedia Systems (2003) Multimedia Systems Digital Object Identifier (DOI) 10.1007s00530-

Multimedia Systems (2003) Digital Object Identi?er (DOI) 10.1007/s00530-003-0076-5

Multimedia Systems
? Springer-Verlag 2003

Hierarchical video content description and summarization using uni?ed semantic and visual similarity
Xingquan Zhu1 , Jianping Fan2 , Ahmed K. Elmagarmid3 , Xindong Wu1
1 2 3

Department of Computer Science, University of Vermont, Burlington, VT 05401, USA Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA

Abstract. Video is increasingly the medium of choice for a variety of communication channels, resulting primarily from increased levels of networked multimedia systems. One way to keep our heads above the video sea is to provide summaries in a more tractable format. Many existing approaches are limited to exploring important low-level feature related units for summarization. Unfortunately, the semantics, content and structure of the video do not correspond to low-level features directly, even with closed-captions, scene detection, and audio signal processing. The drawbacks of existing methods are the following: (1) instead of unfolding semantics and structures within the video, low-level units usually address only the details, and (2) any important unit selection strategy based on low-level features cannot be applied to general videos. Providing users with an overview of the video content at various levels of summarization is essential for more ef?cient database retrieval and browsing. In this paper, we present a hierarchical video content description and summarization strategy supported by a novel joint semantic and visual similarity strategy. To describe the video content ef?ciently and accurately, a video content description ontology is adopted. Various video processing techniques are then utilized to construct a semi-automatic video annotation framework. By integrating acquired content description data, a hierarchical video content structure is constructed with group merging and clustering. Finally, a four layer video summary with different granularities is assembled to assist users in unfolding the video content in a progressive way. Experiments on real-word videos have validated the effectiveness of the proposed approach. Key words: Hierarchical video summarization – Content description – Semi-automatic video annotation – Video grouping

a single data stream. Owing to the decreased cost of storage devices, higher transmission rates, and improved compression techniques, digital videos are becoming available at an ever-increasing rate. However, the manner in which the video content is presented for access such as browsing and retrieval has become a challenging task, both for application systems and for viewers. Some approaches have been described elsewhere [1–3], which present the visual content of the video in different ways, such as hierarchical browsing, storyboard posting, etc. The viewer can quickly browse through a video sequence, navigate from one segment to another to rapidly get an overview of the video content, and zoom to different levels of detail to locate segments of interest. Research in the literature [3] has shown that, on average, there are about 200 shots for a 30-minute video clip across different program types, such as news and drama. Assuming that a key-frame is selected to represent each shot, 200 frames will impose a signi?cant burden in terms of bandwidth and time. Using spatially reduced images, commonly known as thumbnail images, can reduce the size further, but may still be expensive if all shots must be shown for a quick browse of the content. Hence, a video summarization strategy is necessary to present viewers with a compact digest that shows only parts of video shots. Generally, a video summary is de?ned as a sequence of still or moving pictures (with or without audio) presenting the content of a video in such a way that the respective target group is rapidly provided with concise information about the content, while the essential message of the original is preserved [4]. Three kinds of video summary styles are commonly used: ? A pictorial summary [3, 5–11] is a collection of still images (icon images, even varying in size) arranged in time order to convey the highlights of the video content. ? A video skimming [4, 12–15] is a collection of moving frames (video shots) arranged in time series to convey the main topics in the video, i.e. it is a trimmed video. ? A data distribution map [13, 15] is a picture to illustrate the distribution of some speci?c data in the database. Obviously, a video summary is the most appealing in video browsing. By supplying a compact digest, the user can browse the video content quickly and comprehensively. Moreover, the power of visual summary can be helpful in many applications,

1. Introduction Recent years have seen a rapid increase in the use of multimedia information. Of all media types, video is the most challenging, as it combines all other media information into
Correspondence to: X. Zhu (e-mail: xqzhu@cs.uvm.edu)


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

such as multimedia archives, video retrieval, home entertainment, digital magazines, etc. More and more video material is being digitized and archived worldwide. Wherever digital video material is stored, a duplicated summary could be stored at any node on the Internet, and the user’s query could be processed only at these nodes to work out a rough query result. In this way, the system could release a substantial amount of CPU time and bandwidth to process more queries. Moreover, since the abstract is always far shorter than the original data, each user’s query time is reduced signi?cantly. Furthermore, if the library grew to thousands of hours, queries could return hundreds of segments. Generating a summary of the query results would allow users to browse the entire result space without having to resort to the time-consuming and frustrating traversal of a large list of segments. Nevertheless, without a comprehensive understanding of the video content, the generated video summary would be unsatisfactory for most users. Consequently, various video indexing strategies have been proposed to describe the video content manually, semi-automatically or fully automatically. Based on different types of knowledge utilized, the video indexing can be distinguished into the following three categories: ? High-level indexing: this approach uses a set of prede?ned index terms to annotate videos. The index terms are organized based on high-level ontological categories like action, time, space, etc. ? Low-level indexing: these techniques provide access to videos based on low-level features such as color, texture, audio, and closed-captions. The driving force behind these techniques is to extract data features from the video data, organize the features based on some distance measures, and use similarity-based matching to retrieve the video. ? Domain Speci?c Indexing: these techniques use the highlevel structure of video to constrain low-level video feature extraction and processing. However, they are effective only in their intended domain of application. Based on the content description acquired from video indexing, various kinds of applications can be implemented [16– 18]. The rest of the paper is organized as follows. In the next section, related work on video annotating and summarization is reviewed. The overall system architecture of our proposed method is described in Sect. 3. In Sect. 4, a video content description ontology is proposed. Based on this ontology, the video content annotation scheme is presented in Sect. 5. Our hierarchical video summarization scheme is introduced in Sect. 6. In Sect. 7, the effectiveness of the proposed approach is validated by experiments over real-word movie video clips, and some potential application domains of the proposed strategies are outlined. Concluding remarks are given in Sect. 8. 2. Related work Video annotation and indexing issues have been addressed with various approaches. Due to the inadequacy of textual terms in describing the video content, a textual based video annotation has led to considerable loss of information. Accordingly, many low-level indexing strategies have emerged [13, 15, 19–22] which use closed-captions, audio information,

speech recognition, etc. to explore video content. Some video classi?cation methods have also been developed to detect the event or topic information within the video [23–26], but they are only effective in their own speci?c domain; moreover, only a fraction of events can be detected. We are currently able to automatically analyze shot breaks, pauses in audio, and camera pans and zooms, yet this information alone does not enable the creation of a suf?ciently detailed representation of the video content to support content-based retrieval and other tasks. As their experiments have shown, there is still a long way to go before we can use these methods to acquire satisfactory results. Hence, manual annotation is still widely used to describe video content. The simplest way to model video content is using free text to manually annotate each detected shot separately. However, since a segmented part of the video is separated from its context, the video scenario information is lost. To address this problem, Aguierre Smith et al. [27] implement a video annotation system using the concept of strati?cation to assign description to video footage, where each stratum refers to a sequence of video frames. The strata may overlap or totally encompass each other. Based on this annotation scheme, the video algebra [28] was developed to provide operations for the composition, search, navigation and playback of digital video presentation. A similar strategy for evolving documentary presentation could be found in [29]. Instead of using pure textual terms for annotation, Davis et al. [30] present an iconic visual language-based video annotation system, Media Stream, which enables users to create multi-layered, iconic annotations of video content; however, this user-friendly visual approach to annotation is limited by a ?xed vocabulary. An overview of research in this area could be found elsewhere [31]. Each of the manual annotation strategies identi?ed above may be ef?cient in addressing speci?c issues, but problems still remain: 1. Even though the hierarchical strategy (strati?cation) has been accepted as an ef?cient way for video content description, no ef?cient tool has been developed to integrate video processing techniques (video shot and group detection, joint semantic and visual features in similarity evaluation, etc.) for semi-automatic annotation to free annotators from sequentially browsing and annotating video frame by frame. 2. The keywords at different content levels have different importance and granularity in describing video content; hence, they should be organized and addressed differently. 3. To minimize the subjectivity of the annotator and the in?uence of wide spread synonymy and polysemy in unstructured keywords, one ef?cient way is to utilize content description ontologies. The existing methods either fail to de?ne their ontology explicitly (using free text annotation [33]) or do not separate the ontology with annotation data to enhance the reusability of the annotation data. To address these problems, a semi-automatic video annotation scheme is proposed. We ?rst de?ne the content description ontology, as shown in Fig. 2. Then various video processing techniques are introduced to construct a semi-automatic video annotation strategy. Even without a comprehensive understanding of video content, many low-level tools have been developed to gen-

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


erate video summaries by icon frames [3, 6–9, 11] or video objects [5, 10]. A curve simpli?cation strategy is introduced in [6] for video summarization, which maps each frame into a vector of high dimensional features, and segments the feature curve into units. A video summary is extracted according to the relationship among them. In video Managa [7], a pictorial video summary is presented with key frames in various sizes, where the importance of the key frame determines its size. To ?nd the highlight units for video abstracting, Nam et al. [11] present a method that applies different sampling rates on videos to generate the summary. The sample rate is controlled by the motion information in the shot. Instead of summarizing general videos, some abstracting strategies have been developed to deal with videos in a speci?c domain, such as home videos [12], stereoscopic videos [5], and online presentation videos [8]. The general rules and knowledge followed by the video are used to analyze the semantics of the video. On comparing with abstracting the video with pictorial images, some approaches summarize a video by skimming [13, 15], which trims the original video into a short and highlight stream. Video skimming may be useful for some purposes, since a tailored video stream is appealing for users. However, the amount of time required for viewing a skimming suggests that skimmed video is not appropriate for a quick overview, especially for network-based applications where bandwidth is the most of concern. Neither pictorial abstracts nor skimming has the greater value, and both are supported by the strategy presented in this paper. Ideally, a video summary should brie?y and concisely present the content of the input video source. It should be shorter than the original, focus on the content, and give the viewer an appropriate overview of the whole. However, the problem is that what’s appropriate varies from viewer to viewer, depending on the viewer’s familiarity with the source and genre, and with the viewer’s particular goal in watching the summary. A hierarchical video summary strategy [9, 34] is proposed accordingly by supplying various levels of summarization to assist the viewer in determining what is appropriate. In [9], a key frame based hierarchical summarization strategy is presented, where key frames are organized in a hierarchical manner from coarse to ?ne temporal resolution using a pairwise clustering method. Instead of letting a user accept the generated video summary passively, movieDNA [34] supplies the user with a hierarchical, visualized, and interactive video feature map (summary) called DNA. By rolling the mouse over the DNA, users can brush through the video, pulling up detailed meta-information on each segment. In general, the aforementioned methods all work with nearly the same strategy: grouping videos, selecting low-level features related important units, acquiring users’ speci?cation of summary length, assembling. Unfortunately, there are three problems with this strategy: First, it relies on low-level features to evaluate the importance of each unit. But selected highlight may not be able to cover important semantics within the video, since there is no general linkage between low-level feature and semantics. Second, important unit selection is a semantic-related topic. Different users have different value judgments, and it would be relatively dif?cult to determine how much more important one unit is than the others. Third, the length of the users’ speci?cations for the summary is not always reasonable in unfolding video content, especially if the

user is unfamiliar with videos in the database. As a result, those strategies just present a “sample” of the video data. Hence, we need an ef?cient technique to describe, manage and present the video content, without merely sampling the video. Video is a structured media. While browsing video, it is not the sequential frames but the hierarchical structure (video, scenes, group, etc.) behind frames that convey scenario and content information to us. Hence, the video summary should also ?t this hierarchical structure by presenting an overview of the video at various levels. Based on this idea and acquired video content description data, this paper presents a hierarchical video summarization scheme that constructs a four layer summary to express video content.

3. System architecture Figure 1 presents the system architecture of the strategy proposed in this paper. It consists of two relatively independent parts: Hierarchical video content description and hierarchical video summarization. First, to acquire video content, the content description ontology is proposed (as described in Sect. 4). Then, all shots in the video are parsed into groups automatically. A semiautomatic video annotation is proposed which provides a friendly user interface to assist the system annotator in navigating and acquiring video context information for annotation. A video scene detection strategy using joint visual features and semantics is also proposed to help annotators visualize and re?ne annotation results. After the video content description stream has been acquired, it is combined with the visual features to generate a hierarchical video summary. By integrating semantics and low-level features, the video content is organized into a four level hierarchy (video, scene, group, shot) with group merging and clustering algorithms. To present and visualize the video content for summarization, various strategies have also been proposed to select the representative group, representative shots, and representative frames for each unit. Finally, a four layer video summary is constructed to express the video digest in various layers, from top to bottom in increasing granularity.

4. Video content description architecture To enable search and retrieval of video for large archives, we need a good description of video content. Due to the fact that different users may have various perceptions of the same image or video, and moreover, the wide spread synonymy and polysemy in natural language may cause annotators to use different keywords to describe the same object. The ontology based knowledge management is utilized for annotation: We ?rst de?ne the video content description ontology, as shown in Fig. 2. Then a shot based data structure is proposed to describe video content and separate ontology from annotation data. One of the main originalities of our approach is that it allows dynamic and ?exible video content description with various granularities where the annotations are independently from the video data.


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Video content annotation Ontology

Digital video database

Video shots




Annotated video groups
Shot grouping Group merging and clustering integrate semantic and visual features

Video groups Annotation
? ? ? ? ? ? ? ?





Semi-automatic video annotation
Hierarchical group structure

Event category selection to construct the highest summary
Dialog Dialog Dialog Front Group refinement integrate semantic and visual features
? ?


Back Video
? ?

Presentation Dialog Diagnose



Hierarchical video summarization Diagnose Dialog Presentation Hierarchical video summary structure

Hierarchical video content description

Fig. 1. System architecture

Video Description Video Category News Medical video Sports video …. Speciality Category (Medical video) Ophthalmology Immunology ….

Group Description Actor Doctor Nurse Patient …. Event Presentation Surgery Diagnose …. Object Doctor Expert Hand ….

Shot Description Action Walk Give Move …. Location Room Gallery Desk ….

Frame Description Object Computer Heart Eyeball …. Status Beating Rotating Hanging ….

Legend: A B C … A B, C Descriptor Instance of A

Fig. 2. Video content description ontology

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


4.1. Ontology-based knowledge management In recent years the development of ontologies has been moving from the realm of Arti?cial Intelligence laboratories to the desktops of domain experts. Some successful applications of ontologies have been implemented on the web [47] range from large taxonomies categorizing websites (e.g.Yahoo (www.yahoo.com)) to categorizations of products for sale and their features (e.g. Amazon.com (www.amazon.com)). However, the research of ontology-based video content annotation is rarely addressed. Generally, the ontology is de?ned as an explicit and formal speci?cation of a shared conceptualization of knowledge, it provides a suitable format and a common-shared terminology for the description of the content of knowledge sources. Typical ontologies consist of de?nitions of concepts relevant for the domain, their relations, and axioms about these concepts and relationships. By using a given domain ontology one can annotate content of provided knowledge source in such a way that a knowledge-seeker can ?nd the knowledge source easily, independently of its representation format. Hence, the role of ontologies is twofold: (1) they support human understanding and organizational communication; and (2) they are machine processible, and thus facilitate content-based access, communication and integration across different information system. An ontology can be constructed in two ways: domain dependent and generic. Generic ontologies (e.g. WordNet [36]) are constructed to make a general framework for all (most) categories encountered by human existence; they are usually very large but not very detailed. For our purposes, to describe video content, we are interested in creating domain dependent ontologies which are generally much smaller. During the construction of ontologies, the features below are taken into consideration to guarantee the quality of the ontology. ? Open and dynamic: ontologies should have ?uid boundaries and be readily capable of growth and modi?cation. ? Scalable and inter-operable: an ontology should be easily scaled to a wider domain and adapts itself to new requirements. ? Easily maintained: it should be easy to keep ontologies upto-data. Ontologies should have a simple, clear structure, as well as be modular. It should also be easy for humans to inspect. Accordingly, the domain experts or system managers, someone who has mastery over the speci?c content of a domain [37], should be involved to create and maintain the ontologies by considering the general steps below: 1. 2. 3. 4. Determine the domain and scope of the ontology. Consider reusing existing ontologies. Enumerate important terms in the ontology. Explicitly specify the de?nition of each class and indicate the class hierarchy. 5. Create the instances of each class. Some existing schemes and researches have addressed the problems on creating and maintaining the ontologies, the details could be found in [48]. After the ontology has been created, it could then be utilized for knowledge management and description.

4.2. Video content description ontology For most applications, the entire video document is at too coarse a level of content description [38, 39]. A single frame, on the other hand, is rarely the unit of interest. This is because a single frame spans a very short interval of time and there are too many individual frames, even in a short video document. As we know, most videos from daily life can be represented by using a hierarchy consisting of ?ve layers (video, scene, group, shots, frames), from top to bottom in increasing granularity for content expression [40]. Consequently, a robust and ?exible video content description strategy should also be able to describe video content at different layers and with different granularity. To construct our video content description ontology, we ?rst clarify that our domain are general video data from our daily life. Then, the general structure information among the videos is considered to construct ontology for video data. When looking at a video, what kinds of things do we want to state about it? From most annotators’ viewpoint, the annotation should answer ?ve “W” related questions, who? what? when? where? and why? Obviously, at different video content levels, the annotations should have different granularities in addressing these ?ve questions. Hence, a video content description ontology is proposed in Fig. 2, where four content descriptions, video description (VD), group description (GD), shot description (SD) and frame description (FD), are adopted with each description de?ned as below (we eliminate the scene level content description in the ontology, since video scenes depict and convey a high-level concept or story, without semantics, it cannot be detected satisfactorily): 1. The video description addresses the category and specialty taxonomy information of the entire video. The description at this level should answer questions like, “What does the video talk about?” 2. The group description describes the event information in a group of adjacent shots that convey the same semantic information. The description at this level should answer the queries like, “give me all surgery units among the medical videos?” 3. The shot description describes the action in the single shot. This action could be a part of an event. For example, a video shot could show the action of a doctor shaking hands with a patient in a diagnosis event. Hence, the shot description should answer queries like, “give me all units where a doctor touches the head of the patient”. 4. At the lowest level, the frame, the description should address the details of objects in a single frame (or series of adjacent frames.) The description should answer queries like “what is in the frame(s)?” The proposed descriptions are then assembled together to construct the framework of the ontology, as shown in Fig. 2. To present and address more details among descriptions, various descriptors are introduced for each description, as shown in Table 1. Obviously, instead of considering the domain information of each video type, the proposed ontology could be utilized to describe most videos in our daily life. However, for certain kinds of videos (e.g. the medical video), the knowledge from domain experts would be necessary for content management and annotation. Hence, we adopt many extendable descriptors in the ontology, which are speci?ed by the


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Table 1. De?nition of descriptors in video content description ontology Description VD GD SD FD Descriptor Video category Speciality Category Event Actor Object Action Location Object Status De?nition Specify video category information Specify taxonomy information in a speci?c video domain Specify the event information in current group Specify the actor of the current event Specify the object(s) in the shot Specify the action of the object(s) Specify the location of the action Specify the object(s) in the current frame (or series of adjacent frames) Specify the condition of the object(s)

domain expert or system manager, to address details of each video type. To determine descriptors and their instances for any given video type, the domain expert is involved to list all possible interesting objects (or classes) among videos. The aggregation of all objects (or classes) will ?nally determine the descriptors and their instances. In our demonstration system [], the expert from medical school is invited to help us in constructing the ontology descriptors and their instances, as shown in Tables 1–4. In the case that the domain experts are not available, we may utilize the information from Internet, for example, Yahoo has a hierarchy of 1,500,000 categories which might be very helpful in creating an ontology. The instances of each descriptor are prede?ned, as shown in Tables 2, 3, and 4. In Table 2, we ?rst classify the video into various domains, such as News program, Medical Video etc. Then, given a speci?c video domain, the specialty category is used to classify the video into speci?c directories, as shown in Table 3, we classify medical videos into categories such as ophthalmology, immunology, etc. An example of the prede?ned event descriptor in the medical video is also given in Table 4. Moreover, the instances of each descriptor are still extensible. While annotating, the annotator may browse the instances of certain descriptor ?rst; if there is no keyword suitable for the current description, one or more instances may be added. The advantage of using content description ontology is outlined below: 1. It supply a dynamic and ef?cient video content description scheme, the ontology could be extended easily by adding more descriptors. 2. The ontology can be separated from the annotation data. The annotator’s modi?cation with the instances of each descriptor could be shared with other annotators. This will enhance the reusability of the description data. Note that, since adjacent frames in one shot usually contain the same semantic content, the annotator may mask a group of adjacent frames as one unit, and annotate them in one time. 4.3. Shot-based video temporal description data organization To enhance the reusability of descriptive data, we should separate the ontology from the description data. That is, the description ontology is constructed, maintained and shared by

Table 2. An example of the instances for video category Home video News program MTV program Comedy Surveillance video Medical video Sports video Course lesson Presentation video Movie Animal program ...

Table 3. An example of the instances for a specialty category of medical video Ophthalmology Endocrinology Organ diagram Immunology Radiobiology Microbiology Cardiology Oncology ...

Table 4. An example of the instances for event descriptor of medical video Presentation Surgery Organ diagram Dialog Experiment Unknown Diagnose Outdoor scene ...

all annotators. With this ontology, various description data could be acquired from different videos. To integrate video semantics with low-level features, a shot based data structure is constructed for each video, as shown in Fig. 3. Given any video shot Si , assuming KA indicates the Keyword Aggregation (KA) of all descriptors in the ontology, then KA = {VDl ,l = 1,. . . ,NV i ; GDl ,l = 1,..NGi ; SDl , l = 1,..NS i ; FDl , l = 1,..NF i }, where VDl , GDl , SDl and FDl represent the keywords of the video description (VD), group description (GD), shot description (SD) and frame description (FD), respectively, and NV i , NGi , NS i and NF i indicate the number of keywords for each description. Moreover, to indicate the region where ID each keyword takes effect, the symbol va ?b is used to denote the region from frame a to frame b in the video with a certain identi?cation (ID). The temporal description data (TDD) for video shot Si is then de?ned as the aggregation of mappings between annotation ontology and temporal frames.
ID ST ED , Si , Si , Map(KA, V )} TDD = {Si


ID speci?es the identi?cation (ID) for current shot where Si ST ED Si . Si and Si denote the start and end frame of Si respectively. KA indicates the keyword aggregation of all descriptors, ID V indicates a set of video streams, va ?b ∈ V, ID = 1, . . . , n,

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


and Map() de?nes the correspondence between annotations ID and video temporal information. M ap(KAi ; va ?b ) denotes the mapping between keyword KAi to region from frame a to frame b in video with certain identi?cation ID. For instance, 2 the mapping M ap(F Dl ; v100 ?200 ) de?nes a one-to-one map2 ping between a FD descriptor keyword FDl and v100 ?200 , where the identi?cation of the video is ID = 2 and the frame region of the mapping is from frame 100 to frame 200. We can also have a many-to-one mapping. For example, the mapping 2 M ap(F Dk , SDl ; v100 ?300 ) de?nes a many-to-one relationship to indicate both keywords FDk and SDl are speci?ed in the region from frame 100 to 300 in video with ID = 2. Similarly, many-to-many and one-to-many relationship can be de?ned, as shown in Eq. 2:
2 2 M ap(F Dk , SDl ; v100 ,800 , v1200?1400 ); 2 2 M ap(GDk ; v200 ?700 , v1300?2300 )

5.1. Video group detection The simplest method to parse video data for ef?cient browsing, retrieval and navigation, is segmenting the continuous video sequence into physical shots and then selecting one or more key-frame for each shot to depict its content information [1, 2]. We use the same approach in our strategy. Video shots are ?rst detected from a video using our shot detection techniques [43]. For the sake of simplicity, we choose the 10th frame of each shot as its key-frame. However, since the video shot is a physical unit, it is incapable in conveying independent semantic information. Various approaches are proposed to determine a cluster of video shots (group or scene *) that convey relatively higher level video scenario information. Zhong et al. [21] propose a strategy, which clusters visually similar shots and supplies viewers with a hierarchical structure for browsing. However, since spatial shot clustering strategies consider only the visual similarity among shots, the context information is lost. To address this problem, Rui et al. [40] present a method which merges visually similar shot into groups, then constructs a video content table by considering the temporal relationships among groups. The same approach is reported in [32]. In [44], a time-constrained shot clustering strategy is proposed to cluster temporally adjacent shots into clusters, and a Scene Transition Graph is constructed to detect the video story unit by utilizing the acquired cluster information. A temporally time-constrained shot grouping strategy has also been proposed [45]. Nevertheless, the video scene is a semantic unit, it is dif?cult in some situations to determine boundaries even with the human eye by using only visual features. Hence, the scene segmentation results with current strategies are unsatisfactory. Compared to other strategies that emphasize grouping all semantically related shots into one scene, our method emphasizes merging those temporally or spatially related shots into groups, and then offering those groups for annotations. The quality of most proposed methods heavily dependent on the selection of thresholds [32, 44, 45]; however, the content and low-level features among different videos are varied. Even in the same video, there may be a large variance. Hence, an adaptive threshold selection for video grouping or scene segmentation is necessary. We use the entropic threshold technique in this paper. It has been shown to be highly ef?cient for the two-class data classi?cation problem. 5.1.1. Shot grouping for group detection Video shots in the same scene have a higher probability of sharing the same background, they may have higher visual similarities when compared with other shots which are not in the same scene. Moreover, shots in the same scene may also be organized in a temporal sequence to convey scenario information. For example, in a dialog scene the adjacent shots usually have relatively low similarity, however, similar shots might be shown back and forth to characterize different actors in the dialog. To address the correlation among shots in the same scene, a shot grouping strategy is proposed in this section to merge semantically related adjacent shots into group(s). To segment spatially or temporally related shots into groups, a given shot is compared with the shots that precede and succeed it (using no more than two shots) to determine


The advantage of the above mapping is that we have separated the annotation ontology from the temporal description data. Hence, the same video data can be shared and annotated by different annotators for different purposes, and can be easily reused for different applications. Given one video, the assembling of the TDD of all shots contained will form its temporal description stream (TDS ). This indicates that all annotation information is associated to each shot, with each shot containing VD, GD, SD and FD information. The reason we utilize such a data structure is clari?ed below: 1. A data description structure based on the single frame level will inevitably incur large redundancy. 2. We can segment video shot boundaries automatically [42, 43] and with a satisfactory result. 3. Video shots are usually taken as the basic units of the video processing techniques [1, 2], a shot based structure will help us seamlessly integrate low-level features with semantics. 4. If there is large content variance in the shot, more keywords can be used in Frame Description to characterize the changing. Hence, the proposed structure will not lose semantic details of the video.

5. Video content annotation Using the video content description ontology described in Sect. 4, the video content can be stored, browsed, and retrieved ef?ciently. But no matter how effective an annotation structure is, annotating videos frame by frame is still a time consuming operation. In this section, a semi-automatic annotation strategy which utilizes various video processing techniques (shot segmentation, group detection, group merging, etc.) is proposed to help annotators acquire video context information for annotation. We will ?rst address techniques on parsing video into shots and semantically related groups. Then, a video scene detection strategy is proposed. Finally, by integrating these techniques, a semi-automatic video annotation scheme is presented.


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Shot Si

Video VD: Video description VD GD SD FD VDl,l=0,…,NDi GDl,l=0,..NGi SDl l=0,..NSi FDl l=0,..NFi Frames FD: Frame description Shots SD: Shot description Groups GD: Group description

Semantic description network for shot i

Fig. 3. Data structure for shot based video temporal description
i-2 i-1


i+ 1

i+ 2

i+ 3

S h o ts

Fig. 4. Exploring correlations among video shots for group detection

correlations between them, as shown in Fig. 4. Since closedcaption and speech information is not available in our strategy, we use visual features to determine the similarity between shots. We adopt a 256 dimensional HSV color histogram and 10 dimensional tamura coarsness texture as visual features. Suppose Hi,l , l ∈ [0, 255] and Ti,l , l ∈ [0, 9] are the normalized color histogram and texture of the key frame i. The visual similarity between shot Si and Sj is de?ned by Eq. 3:

other. A separation factor R(i) for shot Si is de?ned by Eq. 8 to evaluate a potential group boundary. R(i) = (CRi +CRi+1 )/(CLi +CLi+1 ) (8)

The shot group detection procedure then takes the following steps: 1. Given any shot Si , if CRi is larger than TH 2 -0.1: (a) If R(i) is larger than TH 1 , claim that a new group starts at shot Si .. (b) Otherwise, go to step 1 to process other shots. 2. Otherwise: (a) If both CRi and CL i are smaller than TH 2 , claim that a new group starts at shot Si . (b) Otherwise, go to step 1 to process other shots. 3. Iteratively execute step 1 and 2 until all shots are parsed successfully. As the ?rst shot of a new group, both CRi and R(i) of shot Si are generally larger than prede?ned thresholds. Step 1 is proposed to handle this situation. Moreover, there may be shot that is dissimilar with groups on its both sides, with itself acting as a group separator (like the anchor person in a news program.) Step 2 is used to detect such boundaries. Using this strategy, two kinds of shots are absorbed into a given group: ? Shots related in temporal series, such as a dialog or presentation, where similar shots are shown back and forth. Shots in this group are referred to as temporally related. Examples of temporally related shots are shown as row 1 and 2 in Fig. 5, where adjacent shots have relatively low similarity, however the similar shots are interlaced to be shown in one group. ? Shots related in visual similarities, where all shots in the group are visually similar. Shots in this group are referred

StSim(Si , Sj )=Wc

min(Hi,l , Hj,l )

+WT (1 ?

(Ti,l ? Tj,l )2 )


where WC and WT indicate the weight of color and tamura texture. For our system, we set WC = 0.7, WT = 0.3. To detect the group boundary by using the correlation among adjacent video shots, we de?ne the following similarity distances: CLi = Max{ StSim(Si , Si?1 ), StSim(Si , Si?2 )} CRi = Max{ StSim(Si , Si+1 ), StSim(Si , Si+2 )} (4) (5)

CLi+1 = Max{ StSim(Si+1 , Si?1 ), StSim(Si+1 , Si?2 )} (6) CRi+1 = Max{ StSim(Si+1 , Si+2 ), StSim(Si+1 , Si+3 )} (7) Given video shot Si , if it is the ?rst shot of a new group, it will have larger correlations with shots on its right side (as shown in Fig. 4) than shots on its left side, since we assume the shots in the same group usually have large correlations with each

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


Fig. 5. Group detection results, with group rows from the top to bottom identifying (in order): presentation, dialog, surgery, diagnosis and diagnosis

to as spatially related. Examples of spatially related shots are shown as row 3 in Fig. 5, where adjacent shots are almost similar with each other. Accordingly, given detected group Gi , we will assign it to one of two categories: temporally vs spatially related group. Assuming there are W shots (Si , i = 1,. . . ,W ) contained in Gi , the group classi?cation strategy takes following steps. Input: Video group Gi , and shots Si (i = 1, . . . , W ) in Gi . Output: Clusters (CN c , Nc = 1, . . . , U ) of shots in Gi . Procedure: 1. Initially, set variant Nc = 1; cluster CN c has no members. 2. Select the shot Sk in Gi with the smallest shot number as the seed for cluster CN c , and subtract Sk from Gi . If there are no more shots contained in Gi , go to step 5. 3. Calculate the similarity between Sk and other shots Sj in Gi , If StSim(S k , Sj ) is larger than threshold Th , absorb shot Sj into cluster CN c . Subtract Sj from Gi . 4. Iteratively execute step 3, until there are no more shots that can be absorbed into current clusterCN c . Increase Nc by 1 and go to step 2. 5. If Nc is larger than 1, we claim Gi is a temporally related group, otherwise, it is a spatially related group. Remark: In this paper, the video group and scene are de?ned as similar as in [40]: (1) A video group is an intermediate entity between the physical shots and semantic scenes. Examples of groups are temporally related shots or spatially related shots. (2) A video scene is a collection of semantically related and temporally adjacent groups, depicting and conveying a highlevel concept or story. A video scene usually consists of one or more video groups. 5.1.2. Automatic threshold detection As stated in the above section, the thresholds TH 1 , TH 2 , are the key values for obtaining good results. An entropic threshold technique is used in this section to select the optimal thresholds for these two factors. A fast entropy calculation method is also presented. To illustrate, assume the maximal difference of R(i)

in Eq. 8 is in the range [0, M ]. In an input MPEG video, assume there are fi shots whose R(i) has the value i, i ∈[0, M ]. Given a threshold, say T , the probability distribution for the groupboundary and non-group-boundary shots can be de?ned. As they are to be regarded as the independent distributions, the probability for the non-group-boundary Pn (i) shots can be de?ned as:

Pn (i) = fi
h=0 T

fh , 0 ≤ i ≤ T



fh gives the total number of shots with ratio R(i)

in range 0 ≤ i ≤ T . The probability for the group-boundary shots Pe (i) can be de?ned as:

Pe (i) = fi
h=T +1 M h=T +1

fh , T + 1 ≤ i ≤ M


fh is the total number of shots with ratio R(i) in the

range T +1 ≤ i ≤ M . The entropies for these two classes, group boundary shot and non-group- boundary shot are then given by:

Hn (T ) = ?
i=0 M

Pn (i) log Pn (i); Pe (i) log Pe (i)
i=T +1

He (T ) = ?


The optimal threshold vector TC has to satisfy the following criterion function [46]: H (Tc ) = max{Hn (T ) + He (T )}
T =0...M


To ?nd the global maximum of Eq. 12, the computation burden is bounded by O(M 2 ). To reduce the search burden, a fast search algorithm is proposed to exploit the recursive iterations for calculating the probabilities Pn (i), Pe (i)


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

and the entropies Hn (T ), He (T ), where the computational burden is introduced by calculating the re-normalized part repeatedly. We ?rst de?ne the total number of the pairs in the non-group-boundary and group-boundary classes (the renormalized parts used in Eq. 9 and 16) when the threshold is set to T :

background or exhibit the same dominant color, this operation helps in merging shots in each scene into one or several groups. In Sect. 5.3, these groups will help the annotator acquire video context information and annotate video effectively. 5.2. Joint semantics and visual similarity for scene detection Using the procedure described above, video shots can be parsed into semantically related groups. The semi-automatic annotation strategy in Sect. 5.3 then uses these groups to help the annotator determine the video context and semantics for annotation. After video groups have been annotated, semantics and low-level features are both available for each group. Thus, a group similarity assessment using joint semantic and visual features could be used to merge semantically related adjacent groups into scenes. With this detected scene structure, annotators can visually evaluate their annotation results. As we know, the group consists of spatially or temporally related shots, accordingly, the similarity between groups should be based on the similarity between shots. 5.2.1. Semantic similarity evaluation between shots In Sect. 4, we speci?ed that the mapping of each keyword has recorded the frame region where this keyword takes effect. To evaluate the semantic similarity between shots, this region should also be considered since the region information determines the importance of the keyword in describing the shot content. For Video Description, Group Description and Shot Description, the keywords at these levels have longer (or equal) duration than the current shot. Hence, they will be in effect over the entire shot. However, descriptors in the Frame Description may last only one or several frames in the shot, to calculate the semantic similarity between shots, the Effect Factor of each FD descriptor’s keyword should be calculated ?rst. Assuming FDk denotes the k th keyword of FD. Given shot Sl , we suppose there are N mappings associated with FDk in 2 2 shot Sl , and their mapping regions are Va , . . . , Va . 1 ?b1 N ?bN 2 2 Given any two regions Vai ?bi , Vaj ?bj (i = j, i, j ∈ N ) in these mappings, Fig. 6 shows two type of relationships between frame region (ai , bi ) and (aj , bj ) in shot Sl : with or without temporal overlap. Assume operator Θ(X, Y ) denotes the number of over2 2 lapped frames between X and Y , then, Θ(Va , Va ) in i ?bi j ?bj Fig. 6 is given by Eq. 16:
2 2 Θ(Va , Va ) i ?bi j ?bj

P0 (T ) =

fh ; P1 (T ) =
h=T +1



The corresponding total number of pairs at global threshold T + 1 can be calculated as: P0 (T + 1) = P1 (T + 1) =
T +1 h=0 M

fh =

T h=0

fh + fT +1 = P0 (T ) + fT +1

h=T +2

fh =

h=T +1

(14) fh ?fT +1 = P1 (T ) ? fT +1

The recursive iteration property of the two corresponding entropies can then be exploited by Eq. 15 Hn (T + 1)
T +1


fi fi log P0 (T + 1) P0 (T + 1)
T +1 i=0


P0 (T ) P0 (T + 1)

fi P0 (T ) fi log P0 (T ) P0 (T ) P0 (T + 1)

fT +1 P0 (T ) fT +1 Hn (T ) ? log = P0 (T + 1) P0 (T + 1) P0 (T + 1) P0 (T ) ? P0 (T + 1) P0 (T ) log P0 (T + 1) He (T + 1)


i=T +2

fi fi log P1 (T + 1) P1 (T + 1)
M i=T +2


P1 (T ) P1 (T + 1)

fi P 1 (T )

P1 (T ) fi P1 (T ) P1 (T + 1) P1 (T ) fT +1 He (T ) + = P1 (T + 1) P1 (T + 1) P1 (T ) P1 (T ) fT +1 ? log log P1 (T + 1) P1 (T + 1) P1 (T + 1) log The recursive iteration is reduced by adding only the incremental part, and the search burden is reduced to O(M ). The same strategy can be applied to ?nd the optimal threshold for TH 2 . Assume the optimal threshold for CRi , and CL i are detected as TLR, TLL, respectively, then TH 2 is computed as TH 2 = Min(TLR,TLL). Figure 5 presents experimental results of our group detection strategy. As it demonstrates, video shots in one scene are semantically related, and parts of the shots will share the same


0 (ai , bi ) and (aj , bj ) have no overlap bi ? aj (ai , bi ) and (aj , bj ) have overlap


Hence, the Effect Factor of keyword FDk corresponding to shot Sl is de?ned by Eq. 17: EF (F Dk , Sl )
N N ?1 N



(bm ? am ) ?

m, n ∈ N

m=1 n=m+1 SlED ? SlST

ID ID Θ(Va , Va ) m ?bm n ?bn



X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


ai bi

aj ….….….…. Shot Sl




aj ….….….…. bi



Shot Sl

Fig. 6a,b. Frame region relationship between (ai , bi ) and (aj , bj ) in Shot Sl . a Without temporal overlap, b with temporal overlap
ID ID where Va , . . . , Va is the mapping region associated 1 ?b1 N ?bN to FDk , am and bm denote the start and end frame of each mapping region. In fact, Eq. 17 indicates that the effect factor of keyword FDk is the ratio of the number of all non-repeated frame regions mapping with FDk and the number of frames in shot Sl . It is obvious that EF (F Dk , Sl )is normalized in [0,1]. The larger the value, the more important the keyword is in addressing the semantic content of Sl . To evaluate the cross-intersection between keywords at various levels, we de?ne V DS k , GDS k , SDS k , F DS k as the aggregation of keywords which have been used to annotate shot Sk in VD, GD, SD and FD, respectively. That is, GDS k denotes the keyword aggregation of all descriptors in GD which has been used in Sk , and so on. To describe the relationship among series of keywords (X1 , X2 , . . . , XN ), three operators {? (X1 , X2 , . . . , XN ), ?(X1 , X2 , . . . , XN ), Ψ (X )} are de?ned:

5.2.2. Uni?ed similarity evaluation joint semantics and visual similarity With the semantic similarity between Si and Sj , the uni?ed similarity which joint visual features and semantics is given by Eq. 13: ShotSim(Si , Sj ) = (1 ? α) · StSim(Si , Sj ) +α · SemShotSim(Si , Sj )


1. ? (X1 , X2 , . . . , XN ) = {X1 ∪ X2 ∪ . . . ∪ XN } indicates the union of X1 , X2 , . . . , XN . 2. ? (X1 , X2 , . . . , XN ) = {X1 ∩ X2 ∩ . . . ∩ XN } is the intersection of X1 , X2 , . . . , XN . 3. Ψ (X ) represents the number of keywords in X. Given any two shots Si and Sj , assume their tempoID ST ED ral description data (TDD) are TDDi = {Si , Si , Si , ID ST ED Map(KA, V )} and TDDj = {Sj , Sj , Sj , Map(KA, V )}, respectively. Assuming also that KAS i denotes the union of all keywords that have been shown in annotating shot Si , then KAS i = ? (V DS i , GDS i , SDS i , F DS i ) and KAS j = ? (V DS j , GDS j , SDS j , F DS j ). The semantic similarity between shot Si and Sj , can be evaluated using Eq. 18: SemShotSim(Si , Sj ) = WV Ψ (?(V DS i , V DS j )) Ψ (?(GDS i , GDS j )) + WG Ψ (? (V DS i , V DS j )) Ψ (? (GDS i , GDS j )) Ψ (?(SDS i , SDS j )) + Ψ (? (SDS i , SDS j )) {EF (F Dk , Si ) · EF (F Dk , Sj )}
k F Dk ∈?(F DS i ,F DS j )

where StSim(S i , Sj ) indicates the visual similarity which is speci?ed in Eq. 8. α ∈ [0, 1] is the weight of semantics in similarity measurement, which can be speci?ed by users. The larger the α, the greater the importance is given to the semantics in the overall similarity assessment. If α = 0, we use only visual features to evaluate the similarity between Si and Sj . Based on Eq. 19, given a shot Si and a video group Gj , the similarity between them can be calculated using Eq. 20: StGpSim(Si , Gj ) = M ax{ShotSim(Si , Sl )}
Sl ∈Gj


This indicates that the similarity between shot Si and group Gj is the similarity between Si and its most similar shot in Gj . In general, when comparing the similarity between two groups using the human eye, we usually use the group with less shots as the benchmark, and then determine whether there is any shot in the second group similar to certain shots in benchmark group. If most shots in the two groups were similar enough, we would consider these groups to be similar. Accord? i,j is the benchmark ingly, given group Gi and Gj , assume G ? group, and Gi,j is the other group. Suppose M (X ) denotes the number of shot in group X , then, the similarity between Gi and Gj is given in Eq. 21: GroupSim(Gi , Gj ) 1 = ? i,j ) M (G
? i,j ) M (G ? i,j l=1;Sl ∈G


? i,j ) StGpSim(Sl , G



Ψ (? (F DS i , F DS j ))


That is, the similarity between Gi and Gj is the average similarity between shots in the benchmark group and the other group. 5.2.3. Scene detection As we de?ned in Sect. 5.1, a video scene conveys a highlevel concept or story and usually consists of one or more semantically related adjacent groups. For annotating, we ignore the scene level description, since automatic scene detection with satisfactory results is not yet available. After most

From Eq. 18, we can see that the semantic similarity between shot Si and Sj is the weighted sum of the cross intersection of keywords at various levels. From VD to FD, the keywords will address more and more detailed information in the shot. Therefore, in our system we set the weight of various levels (WV , WG , WS , WF ) to 0.4, 0.3, 0.2 and 0.1 respectively. That is, the higher the level, the more important the keyword is in addressing content.


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

video groups haven been annotated, we can integrate semantics and visual features among adjacent groups to merge similar groups into semantically related units (scenes). The constructed scenes will help the annotator visualize and re?ne the annotation results. To attain this goal, all neighboring groups with signi?cant similarity in semantic and visual features will be merged using strategy below: 1. Given any group Gi , assume GDE i denotes the keyword aggregation of GD’s event descriptors which has been used in all shots of Gi . 2. For any neighboring groups Gi and Gj , if ?(GDEi , GDEj ) = ?, these two groups are not merged. Otherwise, go to step 3. That is, if the keyword two groups’ event descriptor is totally different, they cannot be merged into one group. 3. Using Eq. 15 to calculate the overall similarity between these two groups; go to step 2 to ?nd all other neighboring groups’ similarities. Then go to step 4. 4. All neighboring groups with their similarities larger than threshold TH 3 (TH 3 = TH 2 ? 0.2) are merged into one group. As a relatively special situation, if there are more than 2 sequentially neighboring groups, e.g. A, B, C, with similarities GroupSim(A,B) and GroupSim(B,C) both larger than threshold TH 3 , all groups are merged into a new group (scene). Clearly, groups in one scene should have higher correlations in semantic and visual features. By integrating semantic and visual similarity, they will have a higher probability of being merged into one scene. 5.3. Semi-automatic video annotation As we stated above, a sequential annotation strategy is a time consuming and burdensome operation. Instead of annotating the video sequentially, we can utilize the results of the video processing techniques above to help annotators determine the video context and semantics. Some semi-automatic annotation schemes have been implemented in image database [47, 48] using semantics, visual features and relevance feedback [49] to assist the annotator in ?nding certain type of images for annotation. Based on these schemata, a semi-automatic annotation scheme for video database is presented in this section and the main ?ow is shown in Fig. 1. 5.3.1. Semi-automatic annotation with relevance feedback As the ?rst step, the shot grouping method is adopted to segment spatially or temporally related shots into groups (Since the video semantics are not available in current stage, our group detection strategy uses only low-level features.). Then, the groups are shown sequentially in the interface for annotation, as shown in Fig. 7. Given any group in the interface, the annotator has three options: 1. Annotate a certain shot by double clicking the key-frame of the shot (the result is illustrated in Fig. 8.) The annotator can assign both shot description and frame description keywords related to the shot (and frames in the shot). A series of function buttons such as play, pause, etc. are available

to help the annotator browse video shots and determine semantics of the shot and frames. 2. If the annotator thinks that the current group belongs to the same event category, he (she) can specify group description and video description keyword(s) to the group by clicking the hand-like icon at the left of the group, and select corresponding keywords to annotate it. 3. If the annotator thinks the current group contains multiple events, he (she) can manually separate it into different groups (with each groups belonging to only one event category) by dragging the mouse to mask shots in the same event category and then clicking the hand-like icon to assign keywords. By doing this, the current group is separated into several groups and shown separately in the interface. As described in Sect. 4, since we use a hierarchical content description ontology and shot based annotation data structure, all units at the lower level inherit keywords from the level(s) above. For example, if we assign a group description keyword to a group, all shots in this group are annotated with this keyword. At any annotation state, the annotator can select one or a group of shots as the query to ?nd similar groups for annotation. Then, the relevance feedback (RF ) strategy is activated to facilitate this operation: 1. All selected shot(s) are treated as a video group. The annotator should assign keywords to annotate them before the retrieval. 2. After the annotator clicks the “?nd” button, the similarity evaluation strategy in Eq. 21 is used to ?nd similar groups. 3. After the result has been retrieved, the annotator can either annotate those similar groups separately or mark the result (or part of them) as the feedback examples, and click “RF ” button to trigger a RF processing. By selecting RF, all selected shots are annotated with keywords the annotator speci?ed in step 1. Then Eq. 22 is used to retrieve other similar groups. Recursively execute relevance feedback iterations above, more and more video groups could be annotated. In this situation, the system works like a video retrieval system. However, there is some difference, since the retrieved groups are not shown sequentially from top to bottom using their similarity scores. Instead, they remain located at their original position, because the annotator needs the preceding or succeeding groups to provide the context information for annotation. The similarity score is displayed at the left of each group. For the sake of saving interactive space, those groups with a large distance (determined by the annotator’s speci?cation of the number of groups to be returned) to current feedback iteration are displayed as a small symbol in the interface. A double click on the symbol displays all shots in the group on the screen. Equation 22 presents the simpli?ed RF model in our system (based on Bayesian formula [50]), there are no negative feedback examples in our system; all selected shots are treated as positive feedback examples.) Assuming Gi denotes the selected feedback examples in current iteration, for any group Gj in the database, its global similarity Sim(j )k in the current iteration (k ) is determined by its global similarity in the previous iteration Sim(j )k?1 and its similarity to current selected feedback examples GroupSim(Gi , Gj ). η is an operator that

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


Fig. 7. Hierarchical video content annotation interface

Fig. 8. Shot and frame annotation interface

indicates the in?uence of the history to the current evaluation. In our system we set η = 0.3. Sim(j )k = ηSim(j )k?1 + (1 ? η )GroupSim(Gi , Gj ) (22) 5.3.2. Annotation results visualization and re?nement In content description ontology, we ignore the scene level description since automatic scene detection is not yet available. However, by integrating the annotated semantics related to groups, we can merge semantically related adjacent groups into scenes, and present the annotation results to the viewers. The annotator can accordingly evaluate and re?ne the annotations with more ef?ciency. 1. At any annotation stage, the annotator can click the “re?ne” button (shown in Fig. 7). Then, the scene detection strategy

in Sect. 5.2) is invoked to merge adjacent similar groups into scenes. 2. The annotator can specify different values for α to modify the contribution of the semantics to similarity assessment and to evaluate the annotation quality in different situations. By doing this, the annotator can visually browse the video content structure and evaluate the quality of the annotation. After that, the annotator can terminate or resume the operation at any time, i.e. a series of annotation, re?nement, annotation, can be recursively executed until a satisfactory result is achieved. Using these strategies, a more reliable and ef?cient video content annotation is achieved. It is better than manual manner in terms of ef?ciency, and better than automatic scheme in terms of accuracy: 1. A hierarchical video content description ontology is utilized in the system, which address the video content in


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Video groups
Represent group Layer 1

Shot level summary
Layer 2

Group merging

Represent shot Represent frame selection

Group level summary
Layer 3

Scene clustering

Scene level summary
Layer 4

Video level summary

Event category selection

Fig. 9. Diagram of hierarchical video summarization

Presenting topics

Showing evidences

Drawing conclusion


M iddle



Fig. 10. Simpli?ed video scenario and content presentation model

different granularity. The categories, events and detailed information about the video are presented in different levels. It can minimize the in?uence of the annotator’s subjectivity, and enhance the reusability of various annotators’ descriptive keywords. 2. A semi-automatic annotation strategy is integrated which utilizes various video processing techniques (video shot and group detection, joint semantic and visual similarity for scene detection, relevance feedback) to help annotators acquire video context information and annotate video more ef?ciently. Moreover, the annotated video content structure could be visualized directly.

6. Hierarchical video summarization and skimming Among all ?ver layers of the video content hierarchy, it is the scene that conveys the semantic meaning of the video to the viewers by using groups, shots and frames to address the detailed information within each story unit. Hence, the video summary should also ?t this hierarchy by presenting the digest at different levels and with different granularity. Generally, video summaries built only on low-level features are either too rough to retain video content or they contain too much redundancy since it is not possible for them to get semantics. Using the content description data acquired by methods described in previous sections, we can address video content at various level and different granularity, and present a meaningful video summary. In this way, the visual features and semantic information are integrated to construct a four layer summary: shot level summary, group level summary, scene level summary and video level summary. These levels correspond to the summary at each layer of the content hierarchy, and present the video digest from top to bottom in increasing granularity. The ?ow chart of the proposed approach is shown in Fig. 9. Note that: 1. Since groups consist of spatially or temporally related shots, they are the ideal units to summarize shots. That

is, video summary at the shot level (layer 1) consists of all groups to uncover details among the video. 2. As we de?ned in Sect. 5.1, a video scene conveys a highlevel concept or story, and it usually consists of one or more semantically related groups. By using the scene detection strategy in Sect. 5.2, we can combine similar groups into semantically related units (scenes) and use them as group level summaries. 3. Since similar scenes may be shown repeatedly in the video, the summary at the scene level is determined by clustering algorithms to eliminate redundant scenes in the video and present a visually and semantically compact summary. 4. In general, video scenarios can be separated into three parts: (1) presenting subject or topic information; (2) showing evidence and details; and (3) drawing conclusions. These three parts are usually shown separately at the front, middle, and back of the video. A simpli?ed video scenario and content presentation model is shown in Fig. 10. Hence, the summary at the video level is constructed by selecting meaningful event categories from the third layer summary to ?t this model and supply the viewer with the most general overview. Figure 11 illustrates the corresponding steps to construct the summary, which could be described below: Input: video groups, temporal description stream (TDS ) of the video. Output: A four layer video summary in pictorial format or in trimmed video stream format (skimming). Procedure: 1. Construct the summary at layer 1 (shot level summary): Using all groups as summary candidate groups, use SelectRepShot () and SelectRepFrame() (introduced below) to select representative shots and representative frames for each candidate group. The combination of these representative frames and shots will form the video skimming and pictorial summary at shot level.

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


2. Construct the summary at layer 2 (group level summary): Set α = 0.3 and use the scene detection strategy introduced in Sect. 5.2 to merge those neighboring groups with higher semantic and visual similarity into scenes. Then, use SlectRepGroup() (introduced below) to select the representative group for each scene. Take all representative groups as summary candidates, where SelectRepShot () and SelectRepFrame() are used to select representative shots and frames, and assemble them to form the second layer video skimming and summary. The experimental results in Sect. 7 illustrates why we set α to 0.3. In general, neighboring groups in the same scene are related with each other semantically, even they have relatively low visual similarity. On the other hand, if they belong to different scenes, there will be no correlation with their semantics. Hence, by considering visual features and semantics, a scene structure is determined which supplies a well-organized summary for groups. 3. Construct the summary at layer 3 (scene level summary): Based on scene detection results, the SceneClustering() (introduced below) is applied to cluster all detected scenes into a hierarchical structure. Then use SelectRepGroup() to select representative groups. Their representative frames and shots are assembled as the video summary and skimming at the third layer. Since similar scenes are usually shown in the video several times, a clustering operation will eliminate redundancy among them and present a compact summary of scenes. 4. Construct the summary at layer 4 (video level summary): The event category selection for video level summary construction is executed by selecting one group that belongs to different event categories from each part (front, middle, back ) of the video, and then assembling those groups to form the summary at the highest layer. Usually, different events vary in their ability to unfold the semantic content of different types of videos. For example, the presentation and dialog events in medical videos are more important than events such as surgery, experiment, etc., since the former uncovers general content information, and the latter address the details. A scheme for selecting event categories to abstract medical video is proposed in [51]. As a general video summarization strategy, we merely suppose all event categories have the same importance. The following sequential selection strategy is adopted:

c. Recursively execute step “b” until there is no keyword contained in ? (GDE1 , . . . , GDEN ), and go to step “d”. d. Use the same strategy to select summary candidate groups for other two parts of video, then use SelectRepShot () and SelectRepFrame() to select the representative shots and frames for all selected summary candidate groups. Their combination will form the highest layer video skimming and summary. 5. Organize the hierarchical video summary structure and return. Figure 12 shows the system interface of the hierarchical video summarization. The ?rst row shows the pictorial summary of current layer and all other rows indicate current group information. The skimming of each layer is stored as MPEG ?le on disk. In order to utilize the video content description data more ef?ciently, a keyword list is also constructed by gathering all keywords of selected representative shots in current layer. Hence, for video summary at layer k , the keyword lists, ? (KAS 1, KAS 2 ,. . . ,KAS N ) (N is the number of shot in current summary layer k ), is also displayed to supply the viewer with a compact textual description of video content. To present and visualize video content for summarization, the representative unit for scene, group and shot are selected to construct the pictorial summary or skimming. Various strategies below are utilized to select representative shots (SelectRepShot ) and frames (SelectRepFrame) from groups, and select representative groups from scenes (SelectRepGroup). Moreover, the scene clustering algorithm (SceneClustering) and similarity evaluation between scenes (SceneSim) are also proposed. [SelectRepShot] The representative shot of group Gi is de?ned as the shot that represents the most content in Gi . Given any group Gi , this procedure will select its representative shot RT i . In Sect. 5.1, we have merged all shots in Gi into Nc clusters, these clusters will help us to select the representative shots. Given group Gi with Nc clusters Ci , we denote by ST (Ci ) the number of shots contained in cluster Ci . The selection of the representative shot for Gi is based on the cluster information and the description keywords among shots: 1. Given Nc clusters Ci (i = 1, . . . , Nc ) in Gi , use steps 2, 3 and 4 to extract one representative shot for each cluster Ci . In all, there are Nc representative shots selected for each Gi . 2. Given any cluster Ci , the shot in Ci which has more keywords and larger time duration usually contains more semantics. Hence, it is selected as the representative shot. Notice that KAS i denotes the union of all keywords which have been shown in describing shot Si , then Ψ (KASi ) indicates the number of keywords in shot Si . ? RT = arg maxSl {Ψ (KASl ), Sl ∈ Gi }, if there is only one shot contained in RT, it is selected as the representative shot of Gi . ? Otherwise, the shot in RT that has the largest time duration is selected as the representative shot.

a. Separate the video almost equally into three parts {front, middle, and back }. b. For each part of the video, assuming there are N representative groups, RG1 ,RG2 ..,RGN , which have been selected as the scene level summary. Given any representative group (RGi ), since GDE i denotes the keyword aggregation of the event descriptors of all shots in RGi , ? (GDE1 , . . . , GDEN ) denotes the union of GDE 1 ,. . . ,GDE N . Sequentially check each representative group (RG1 ,. . . ,RGN ), and the group RGi with GDEi ? ? (GDE1 , . . . , GDEN ) is selected as the summary candidate. Then, all keywords in GDE i are deleted from? (GDE1 , . . . , GDEN ). That is, only one group is selected to represent each event type.

Layer 4 PR

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Layer 3










Layer 2




??? ???



??? ???




??? ???



Layer 1








Fig. 11. Hierarchical video summarization strategy. (PR, DL, SU and UN represent presentation, dialog, surgery, and unknown events respectively; the solid line indicates that this group is selected as the representative group of the newly generated unit)

Fig. 12. Hierarchical video summary at the ?rst layer (with the ?rst row presents the video summary at current layer, other rows in the interface indicate current group information)

3. The collection of representative shot for each cluster Ci forms the representative shots of Gi . [SelectRepFrame] After we ?nd the representative shot(s) for each group, the key frame of the representative shot(s) is taken as the representative frame(s) of the group. [SelectRepGroup] The representative group of scene SE i is de?ned as the group in Gi which addresses the most content information of SE i . After group merging or scene clustering, the similar groups are merged to form a semantically richer unit (scene), the representative group of the constructed scene should be selected

to represent and visualize its content. Assume SE i represents the newly generated scene which is merged by Ni groups (Gl , l = 1,. . . , N j ), i.e. SEi = G1 ∪ G2 ∪ ... ∪ GNi . Assuming KAGl denotes the union of keywords which have been shown in describing all shots in Gl . Then Ψ (KAGl ) indicates the number of keywords in Gl , and Ψ (? (KAG1 , . . . , KAGNi )) denotes the number of keywords which have been shown in describing all shots in SE i . The representative group (RGi ) of SE i is selected using the procedure below:
Ψ (KAGl ) ? RG = arg maxGl { Ψ (? (KAG , l = 1, . . . 1 ,...,KAGNi )) . . . , Ni }. That is, the group in SE i which contains most of keywords in SE i is considered as the representative group. If there is only one group contained in RG, it is taken as the representative group (RGi ) of SE i . ? Otherwise, the group in RG which has the longest time duration is selected as the representative group of SE i .

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


[SceneClustering] As shown in Fig. 11, this procedure will construct a hierarchy beyond the scene level summary. After scene detection, the newly generated scenes may consist of several other groups. In this case, SceneSim() (introduced below) is used to calculate the similarity between scenes. The procedure of scene clustering is given below: 1. Since a scene with too many shots may result in an ambiguous event category, and in addition, may result in a higher probability to absorb other groups, any scene containing more than 20% of shots in the video is no longer used for clustering. 2. Use the typical clustering algorithm – ISODATA to merge similar scenes into classes. While clustering, the scene similarity is calculated using step 3 and 4 below. 3. Set α = 0.3. Given any two scene SE i and SE j , assume there are Ni and Nj groups (Gi1 , Gi2 , . . . , GiNi ; Gj 1 , Gj 2 , . . . , GjNj ) contained in SE i and SE j respectively, then ? (GDEi1 , . . . , GDEiNi ) denotes keywords of event descriptors which have been shown in all groups of SE i . If ? (? (GDEi1 , . . . , GDEiNi ), ? (GDEj 1 , . . . , GDEjNj )) = ?, the similarity between them is set to 0. Otherwise, go to step 4. That is, the scenes with mutually exclusive keywords of event descriptors cannot be merged into one class. 4. Use SceneSim() to calculate the overall similarity between scene SE i and SE j ; 5. Return the clustered scene structure. [SceneSim] Video scenes consist of groups. Hence, given scene SE i and SE j , assume there are Ni and Nj groups (Gi1 , Gi2 , . . . , GiNi ; Gj 1 , Gj 2 , . . . , GjNj ) contained in SE i and SE j , respectively. We de?ne K (SE i ) to be all groups in SE i , N (SE i ) denotes the number of groups in SE i , that is K (SE i ) = {Gi1 , Gi2 , . . . , GiN i }, and N (SE i ) = Ni . Compare SE i and SE j , the scene containing fewer groups is selected as the benchmark scene. We denote the benchmark ?i,j , and the other scene is denoted as B ?i,j . Then, scene as B the similarity between SE i and SE j is calculated using Eq. 17. That is, the similarity between any two scenes is the average maximal similarity between the groups in the benchmark scene and their most similar groups in the other scene. SceneSim(SEi , SEj ) 1 = ?i,j ) N (B
? i,j ) N (B ? i,j ) l=1,Gl ∈K (B

proposed strategies are presented in this section. About eight hours of medical videos and four hours of news programs are used as our test bed (all the video data are MPEG-I encoded, with the digitization rate equal to 30 frames/s). Videos are ?rst parsed with the shot segmentation algorithm to detect the gradual and break changes. The gradual change frames between shots have been removed successfully during shot segmentation.

7.1. Group detection results The group detection experiment is executed among four medical videos and four news programs. Experimental comparisons are made with [40, 45] (in [40], we only use their group detection strategy). Moreover, to judge the quality of the detected results, the following rule is applied: the group is judged to be correctly detected if and only if all shots in the current group belong to the same scene (semantic unit), otherwise the current group is judged to be falsely detected. Thus, the group detection precision (P ) in Eq. 24 is used for performance evaluation. P = How many groups are rightly detected The number of detected groups (24)

Clearly, by treating each shot as one group, the group detection precision would be 100%. Hence, another compression rate factor (CRF ) is also de?ned in Eq. 25: Detected group number total shot number in the video



The experimental results and comparisons of group detection strategies are given in Table 5. To identify the methods used in the experiment, we denote our method as A, and the two methods in [40, 45] are denoted as B and C respectively. From the results in Table 5, some observations can be made: ? Our video grouping methods achieves the best precision among all methods; about 87% shots are assigned in the right groups. Hence, the annotator will not be required to separate many groups which contain multiple scenes. ? Comparing all three methods, since method C is proposed for scene detection, it achieves the highest compression rate. However the precision of this method is also the lowest. On the other hand, this strategy is a threshold based method, there is no doubt that some of the groups are over segmented or missed. ? As a trade-off with precision, the compression ratio of our method is the worst (28.9%), that is, in an average situation, each group consists of approximately 3.5 shots. However, to supply the video group for content annotation, it is often worse to fail to segment distinct boundaries than to over-segment a scene. In addition, other strategies in the paper such as group merging, hierarchical summarization will enforce the compression ratio. From this point of view, our system achieve relatively better performance.

M ax{GroupSim(Gl , Gk ); ?i,j ); k = 1, 2 . . . N (B ?i,j )} Gk ∈ K (B 7. Experimental results, analysis and applications


Two types of experimental results, group detection and hierarchical summarization, and some potential applications of the


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Table 5. Video group detection results A Movie name Medical 1 Medical 2 Medical 3 Medical 4 News 1 News 2 News 3 News 4 Average Shots 265 221 388 244 189 178 214 190 1889 Detected Groups 62 70 121 72 58 46 57 59 545 P 0.89 0.84 0.82 0.87 0.91 0.85 0.91 0.90 0.87 CRF 0.23 0.32 0.31 0.30 0.31 0.26 0.27 0.31 0.289 B Detected Groups 34 38 61 38 22 26 24 27 270 P 0.78 0.67 0.64 0.76 0.81 0.72 0.76 0.80 0.74 CRF 0.13 0.17 0.16 0.15 0.12 0.15 0.11 0.14 0.143 C Detected Groups 26 18 39 27 14 18 23 19 184 P 0.66 0.57 0.72 0.48 0.64 0.71 0.62 0.67 0.64 CRF 0.098 0.081 0.101 0.111 0.074 0.101 0.107 0.100 0.097

7.2. Hierarchical video summarization results As stated before, a four layer video summary is produced for each video. Three questions are introduced to evaluate the quality of the summary at each layer: (1) How well do you think the summary addresses the main topic of the video? (2) How well do you think the summary covers the scenario of the video? (3) Is the summary concise? For each of the questions, a score from 0 to 5 (where 5 indicates best) is speci?ed by ?ve student viewers after viewing the video summary at each level. Before the evaluation, viewers are asked to browse the entire video to get an overview of the video content. An average score for each level is computed from the students’ scores (shown in Fig. 13). A second evaluation process uses the rate between the numbers of representative frames at each layer and the number of all key frames to indicate the compression rate (RC) of the video summary. To normalize this value with the scores of the questions, we multiply RC by 5 and use this value in Fig. 13. The in?uence of the factor α with the quality of the summary is also addressed by changing the value of α and evaluating the generated summaries. The results are shown in Figs. 13–15. From Figs. 13–15, we can see that as we move to lower levels, the ability of the summary to cover the main topic and the scenario of the video is greater. The conciseness of the summary is the worst at the lowest level, since as the level decreases, more redundant groups are shown in the summary. At the highest level, the video summary cannot describe the video scenarios, but can supply the user with a concise summary and relatively clear topic information. Hence, this level can be used to show differences between videos in the database. It was also found that the third level acquires relatively optimal scores for all three questions. Thus, this layer is the most suitable for giving users an overview of the video selected from the database for the ?rst time. Comparing Figs. 13–15, the in?uence of the factor α on the video summary can be clearly evaluated. When α changes from 0 to 0.5, the summary at each layer has larger and larger compression ratio (RC ). A more concise video summary is acquired for each layer, since with the increase in signi?cance of semantics in the similarity evaluation, all groups with the same event categories would be grouped into one group, and the visual similarity among those groups tends to be neglected. At higher levels, the summary is more concise, however, the

scenario of the video summary is worse. Based on these results, we set α = 0.3 in most other experiments. To evaluate the ef?ciency of our hierarchical video summarization more objectively, a retrieval based evaluation method is presented. In this experiment, all 16 videos in the database are ?rst annotated with our semi-automatic annotation strategy, and then each video is manually separated into three clips (the whole video database contains 16 × 3 = 48 clips). There is no shot or group overlapping with the manually segmented boundaries; that is, the boundary of the segmentation is also the boundary of the shot and group. We then use hierarchical summarization to construct the summary for each clip. We randomly select one clip from database, and use its summary from different layers as the query to retrieve from the database. The ranks of the other two clips which are in the same video as the query are counted to evaluate the system ef?ciency. The similarity assessment between the query summary and clips in video database is evaluated using the strategy below: ? For any given query summary at a speci?ed level i, collect all its representative groups at this level, and denote it as QGi = {RGi1 ,. . . , RGiN, }, where N indicates the number of representative groups. ? For clip CPj in the database, gather all its representative groups at each level k , and denote the result as DGk j = {RGj 1 ,. . . , RGjM, }, where M indicates the number of representative groups. ? Use Eq. 26 to calculate the similarity between the query summary and the summary at layer k in CPj . ? The similarity between the query summary and CPj is evaluated with Eq. 27, and its rank is used for system performance evaluation. SumtoSumSim(QGi , DGk j) ? N ? 1 ? M in{GroupSim(RGil , RGja ); ? ?N ? l=1 ? ? ?RGil ∈ QGi , RGja ∈ DGk , a = 1, . . . , M } ? j ? ? if N ≤ M = M ? 1 ? ? M in{GroupSim(RGjl , RGia ); ?M ? ? l=1 ? ?RG ∈ DGk , ? jl ? j ? RGia ∈ QGi , a = 1, . . . , N } Otherwise


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4



Ques.2 Ques.3 5*RC



Ques.1 Ques.2 Ques.3 5*RC

Hierarchical summary layer

Hierarchical summary layer

Fig. 13. Hierarchical video summary evaluation (α = 0.3)

Fig. 14. Hierarchical video summary evaluation (α = 0.0)
5 4.5 4

SumtoV iSim(QGi , CPj ) = min{SumtoSumSim(QGi , DGk j )}



3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4

Ques.1 Ques.2 Ques.3 5*RC

We randomly select 16 retrieval results from the medical video and news programs, and show them in Table 6. To reduce retrieval time, we select only the summary at level 3 as the query to compare the similarity, and only summaries of the highest three levels (k = 4, 3, 2) in the database are used. From Table 6, we see that our retrieval strategy achieves reasonably good results: the average location of retrieved clips that are in the same video as the query is 3.187 (out of 47 clips, since the query clip is excluded from database). Nevertheless, we notice that the retrieval results for news are worse than for medical videos, because the news programs are usually different from general video data. In common videos, a similar scene may be shown repetitively in the video, but in news programs, most story units are only reported once. Hence, the summary at the front part of the video may be quite different from the summary at the middle or back part. In addition, to address the in?uence of the layer of query summary with the retrieval ef?ciency, the summaries at different layers (k = 4,3,2) are used as queries to retrieve from the database (to evaluate the in?uence of the semantic in video retrieval, we set α equal to 0.3 and 0.0 respectively.) The results are shown in Fig. 16. It can be seen that with the layer goes higher, the retrieval accuracy become worse. With α = 0.3, even at the highest level, the average location of a correctly retrieved clips is ranked 5.26, which is still much better than the query results (6.788) of retrieval at level 2 with α = 0.0. Thus, by considering structured video semantic information, the video retrieval results are improved substantially. We do the retrieval in only three layers (k = 4,3,2), since more time is needed to calculate the similarity at lower layers. The system is implemented in C + + with an MEPG-I decoder that has been developed by our group. Since we need to generate the video skimming at each level, MPEG-I editing tools have also been developed to assemble several video clips into one MPEG-I stream with integrated audio signal. 7.3. Potential application domains With proposed strategies, video content description and summarization could be acquired in an improved way, some potential applications may also be implemented within domains below:

Hierarchical summary layer

Fig. 15. Hierarchical video summary evaluation (α = 0.5)
15 14 13 12 11 10 9 8 7 6 5 4 3 2

Average return location


a=0.3 8.024 6.788 5.263 2.921 3.187 a=0.0





Query summary layer

Fig. 16. Video retrieval results with summaries at different levels

1. Comprehensive content annotation for special videos or home entertainment videos As we mentioned in Sect. 2, it should be very hard, if not impossible, to annotate video content accurately and automatically, especially for those special videos, e.g. the medical videos, where users usually have interests with some regions (or semantic objects) which are impossible for automatic annotation. On the other hand, with the widely spread of home entertainment video equipments, an ef?cient annotation and summarization strategy is also urgently needed for video management. Hence, the proposed strategies provide a practical solution to acquire content descriptions for those video data. 2. Improved content-based video retrieval ef?ciency Historical content-based video retrieval systems employ either textual word based or visual feature based retrieval. Obviously, by integrating video content annotation and


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity

Table 6. Video retrieval results with summaries (Query with summaries at layer 3, α = 0.3) Videos Medical videos News program Average Query 1 2 2 1 4 Query 2 3 5 2 3 Query 3 3 2 1 6 Query 4 4 4 2 3 Query 5 1 7 2 2 Query 6 1 4 3 2 Query7 3 7 6 5 Query8 1 4 3 3


visual features, the retrieval performance could be improved remarkably [49]. Moreover, the query expansion [51] technology could also be integrated to enhance the performance of the video database system. 3. Improved remote video database access The network condition for video transmission is always changing and the transmitted video bit rate is also variable, thus it is very important to support adaptive content delivery and quality of service (QoS) control for online video database system. By utilizing video content description and hierarchical summarization, the video streaming, adaptive transmission and QoS could be directly implemented by considering video content scale and network bandwidth for effective remote video database access. 4. Comprehensive video browsing Browsing has the advantage of keeping the user in the loop during the search process. However, current video retrieval systems do not support ef?cient browsing because of the lack of an ef?cient summary organization structure. With the acquired video annotation and hierarchy summaries, both hierarchical browsing and category browsing are easily supported by presentation of multilevel summaries and annotated video database structure. Obviously, the proposed strategies provide the solutions from video content annotation to summarization. We believe that by integrating those schemes, some more potential applications might be implemented in other multimedia systems. 8. Conclusions In this paper, we have addressed the problem of video content description and summarization for general videos. Due to the unsatisfactory results of video processing techniques in automatically acquiring video content, annotations are still widely used in many applications. Hence, strategies to describe and acquire video content accurately and ef?ciently must be addressed. We have proposed a content description ontology and a data structure to describe the video content at different levels and with different granularities. A semi-automatic annotation scheme with relevance feedback is implemented by utilizing video group detection, joint semantics and visual features for scene detection, etc. to substantially improve annotation ef?ciency. Based on acquired content description data, a hierarchical video summarization scheme has been presented. Unlike other summarization strategies which select important lowlevel feature related units to build video summaries (since the semantics are not available for them), our method has used the acquired semantic information and visual features among video data to construct a hierarchical structure that describes the video content at various levels. Our proposed strategy con-

siders the content hierarchy, video content description data, and redundancy among videos. With this scheme, the video content can be expressed progressively, from top to bottom in increasing levels of granularity. Video summaries that only take into account the low-level features of the audio, video or closed-captioned tracks put a great deal of emphasis on the details but not on the content. Video content or structure analysis is necessary prior to video summarization, because the most useful summary may not be just a collection of the most interesting visual information. Hence, our hierarchical summarization strategy achieves a more reasonable result. With this scheme, the video data can be parsed into a hierarchical structure, with each node containing the overview of the video at the current level. As a result, the structure can be widely used for content management, indexing, hierarchical browsing, or other applications. We can currently explore the possibility of using sequential pattern mining techniques in data mining to automate and enhance video grouping for hierarchical video content description. Since video summarization techniques can be used to construct a summary at any layer, we believe that the hierarchical architecture proposed in this paper can be generalized as a toolkit for video content management and presentation.
Acknowledgements. The authors would like to thank the anonymous reviewers for their valuable comments. We would also like to thank Ann C. Catlin for her help in preparing the manuscript. This research has been supported by the NSF under grants 0209120-IIS, 0208539IIS, 0093116-IIS, 9972883-EIA, and by the U.S. Army Research Laboratory and the U.S. Army Research Of?ce under grant number DAAD19-02-1-0178.

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity


Appendix We provide the following table of notations used for easy reference by the reader.
ID va ?b

: : : : : : : : : : :

: : : : : : :

a set of video stream. the ith shot in the video. the ith group in the video. the ith scene in the video. the ith clip in the video Video Description. Group Description. Shot Description. Frame Description. Keyword aggregation of video ontology. the union of all keywords which have been shown in describing shot Si. the union of all keywords which have been shown in describing group Gj. Temporal description data, each shot has one TDD. Temporal description stream, each video has one TDS. the region from frame a to frame b in video with certain ID. correspondence between annotation (KA) and the video temporal information (V). aggregation of keywords which have been used to annotate shot Sk in VD. aggregation of keywords which have been used to annotate shot Sk in GD. aggregation of keywords which have been used to annotate shot Sk in SD. aggregation of keywords which have been used to annotate shot Sk in FD. indicates the union of X1, X2, .., XN. {X1 ∪X2 ∪..∪XN}. means the intersection of X1, X2,.., XN. {X1 ∩X2 ∩ .. ∩XN}. the number of keyword in X. the number of overlapped frames between X and Y. normalized factor which keyword KAk taking effect in shot Si. the aggregation of event descriptor’s keyword which has been used in GD of Gi. representative shot of group Gi. representative group of scene SEi. visual features similarity between shot Si and Sj. semantic similarity between shot Si and Sj. Unified similarity between Si and Sj which integrate visual features and semantics. similarity between shot Si and group Gj. similarity between group Gi and Gj. similarity between scene SEi and SEj.

Map(KA, V)

: ? ( X1,X2,..,XN) : ? ( X1,X2,..,XN) : Ψ(X) : Θ( X , Y ) :


EF ( KAk , Si ) :
GDEi : RTi : : RGi StSim(Si,Sj) : SemStSim(Si,Sj) : : ShotSim(Si,Sj) StGpSim(Si,Gj) : GroupSim(Gi,Gj) : SceneSim(SEi,SEj):


X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity 6 22. Satoh S, Sato T, Smith M, Nakamura Y, Kanade T CE Name-it: Naming and detecting faces in News video. Network-Centric Computing special issue CE 7 23. Girgensohn A, Foote J (1999) Video classi?cation using transform coef?cients. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ 6:3045–3048 24. Dimitrova N, Elenbaas H, McGee T, Leyvi E, Agnihotri L (2000) An architecture for video content ?ltering in consumer domain. International Conference on Information Technololgy: Coding and computing (ITCC’00) CE 8 25. Zhou W, Vellaikal A, Kuo CCJ (2001) Rule-based video classi?cation system for basketball video indexing. Proceedings of ACM International Conference on Multimedia, Los Angeles, CA 9 26. Haering N, Qian R, Sezan M CE Detecting hunts in wildlife videos. Proceedings of the IEEE International Conference on Multimedia Computing and Systems Volume I CE 10 27. Aguierre Smith T, Davenport G (1992) The Strati?cation System: A design environment for random access video. Third International Workshop on Network and Operating System Support for pp 11 250–261 Digital Audio and Video, CE 28. Weiss R, Duda A, Gifford D (1994) Content-based access to algebraic video. IEEE International Conference on Multimedia Computing and Systems, Boston, MA, pp 140–151 29. Davenport G, Murtaugh M (1995) Context: towards the evolving documentary. Proceedings of ACM Multimedia Conference, San Francisco, CA 30. Davis M (1993) Media streams: An iconic visual language for video annotation. IEEE Symposium on Visual Language, pp 196–202 31. Petkovic M, Jonker W (2000) An overview of data models and query languages for content-based video retrieval. International Conference on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet, Italy 32. Kender J,Yeo B (1998) Video scene segmentation via continuous video coherence. Proceedings of CVPR CE 12 33. Jiang H, Montesi D, Elmagarmid AK (1997) Video text database systems. Proceedings of IEEE Multimedia Systems, Ottawa, Canada 34. Ponceleon D, Dieberger A (2001) Hierarchical brushing in a collection of video data. Proceedings of the 34th Hawaii International Conference on System Sciences, CE 13 35. Luke S, Spector L, Rager D (1996) Ontology-based knowledge discovery on the world-wide web. Proceedings of the Workshop on Internet-based Information Systems, AAAI-96, Portland, OR 36. Miller G (1995) Wordnet: A lexical database for English. Commun ACM 38(11) 37. Mena E, Keshyap V, Illarramendi A, Sheth A (1998) Domain speci?c ontologies for semantic information brokering on the global information infrastructure. Proceedings of FOIS’98 CE 14 38. Bloch G (1988) From concepts to ?lm sequences. Proceedings of RIAO, Cambridge, MA, pp 760–767 39. Parkes A (1992) Computer-controlled video for intelligent interactive use: a description methodology. In: ADN Edwards, S Holland (eds) Multimedia Interface Design in Education. New York 40. RuiY, Huang T, Mehrotra S (1999) Constructing table-of-content for video. ACM Multimedia Syst J 7(5) 359–368

1. Zhang H, Kantankanhalli A, Smoliar S (1993) Automatic partitioning of full-motion video. ACM Multimedia Syst 1(1) CE 1 2. Zhang H, Low CY, Smoliar SW, Zhong D (1995) Video parsing, retrieval and browsing: an integrated and content-based solution. Proceedings ACM Conference on Multimedia, CE 2 3. Yeung M, Yeo B (197) Video visualization for compact presentation and fast browsing of pictorial content. IEEE Trans CSVT 7:771–785 4. Pfeiffer S, Lienhart R, Fischer S, Effelsberg W (1996) Abstracting digital movies automatically. VCIP 7(4):345–353 5. Doulamis N, Doulamis A, Avrithis Y, Ntalianis K, Kollias S (2000) Ef?cient summarization of stereoscopic video sequences. IEEE Trans CSVT 10(4) CE 3 6. DeMenthon D, Kobla V, Doermann D (1998) Video summarization by curve simpli?cation. Proceedings ACM Conference on Multimedia. Bristol, UK, pp 13–16 7. Uchihashi S, Foote J, Girgensohn A, Boreczky J (1999) Video Managa: generating semantically meaningful video summaries. Proceedings ACM Conference on Multimedia,Orlando, FL pp 383–392 8. He L, Sanocki W, Gupta A, Grudin J (1999) Auto-summarization of audio-video presentations. Proceedings of ACM Conference on Multimedia. Orlando, FL pp 489–498 9. Ratakonda K, Sezan M, Crinon R (1999) Hierarchical video summarization. IS&T/SPIE Conference on Visual Communications and Image Processing’99, San Jose, CA 3653:1531-1541 10. Kim C, Hwang J (2000) An integrated scheme for object-based video abstraction. Proceedings of ACM Conference on Multimedia Los Angeles, CA pp 303–311 11. Nam J, Tew?k A (1999) Dynamic video summarization and visualization. Proceedings of ACM International Conference on Multimedia, Orlando, FL 12. Lienhart R (1999) Abstracting home video automatically. Proceedings ACM Multimedia Conference, CE pp 4 37–40 13. Christel M, Hauptmann A, Warmack A, Crosby S (1999) Adjustable ?lmstrips and skims as abstractions for a digital video library. IEEE Advances in Digital Libraries Conference, MD, USA 14. Lienhart R, Pfeiffer S, Wffelsberg W (1997) Video abstracting. Commun ACM 40(12) 15. Christel M (1999) Visual digest for news video libraries. Pro5 ceedings ACM Multimedia Conference, CE , FL 16. Nack F, Windhouwer M, Hardman L, Pauwels E, Huijberts M (2001) The role of highlevel and lowlevel features in style-based retrieval and generation of multimedia presentation. New Review of Mypermedia and Multimedia (NRHM) 2001 17. Venkatesh S, Dorai C (2001) Bridging the semantic gap in content management systems: Computational medial aestetics. Proceedings of COSIGN, Amsterdam, pp 94–99 18. Windhouwer M, Schmidt R, Kersten M (1999) Acoi: A system for indexing multimedia objects. International Workshop on Information Integration and Web-based Applications & Services, Indonesia 19. Smoliar S, Zhang H (1994) Content based video indexing and retrieval. IEEE Multimedia 1(2):62–72 20. Brunelli R, Mich O, Modena C (1996) A survey on video indexing. IRST-technical report 21. Zhong D, Zhang H, Chang S (1997) Clustering methods for video browsing and annotation. Technical report, Columbia University, NJ

X. Zhu et al.: Hierarchical video content description and summarization using uni?ed semantic and visual similarity 41. Aref W, Elmagarmid A, Fan J, Guo J, Hammad M, Ilyas I, Marzouk M, Prabhakar S, Rezgui A, Teoh A, Terzi E, Tu Y, Vakali A, Zhu X (2002) A distributed database server for continuous media. Proceedings of IEEE 18th ICDE demonstration, San Jose, CA 42. Yeo B, Liu B (1995) Rapid scene analysis on compressed video. IEEE Trans CSVT 5(6) 43. Fan J, Aref W, Elmagarmid A, Hacid M, Marzouk M, Zhu X (2001) MultiView: Multilevel video content representation and retrieval. J Electr Imag 10(4):895–908 44. Yeung M,Yeo B (1996) Time-constrained clustering for segmentation of video into story units. Proceedings of ICPR’96 CE 15 45. Lin T, Zhang H (2000) Automatic video scene extraction by shot grouping. Proceedings of ICPR 2000, CE 16 46. Fan J, Yu J, Fujita G, Onoye T, Wu L, Shirakawa I (2001) Spatiotemporal segmentation for compact video representation. Signal Process: Image Commun 16:553–566 47. Zhu X, Liu W, Zhang H, Wu L (2001) An image retrieval and semi-automatic annotation scheme for large image databases on the Web. IS&T/ 4311:168–177 48. Lu Y, Hu C, Zhu X, Zhang H, Yang Q (2000) A uni?ed semantics and feature based image retrieval technique using relevance 17 feedback. Proceedings of ACM Multimedia Conference, CE CA, pp 31–37 49. Zhu X, Zhang H, Liu W, Hu C, Wu L (2001) A new query re?nement and semantics integrated image retrieval system with semi-automatic annotation scheme. J Electr Imag – Special Issue on Storage, Processing and Retrieval of Digital Media 10 (4):850–860 50. Vasconcelos N, Lippman A (1998) A Bayesian framework for content-based indexing and retrieval. Proceedings of DCC’98, Snowbird, UT CE 18 51. Fan J, Zhu X, Wu L (2001) Automatic model-based semantic object extraction algorithm. IEEE Trans Circuits and Systems for Video Technology 11(10)10: 1073–1084 52. Zhu X, Fan J, Elmagarmid A, Aref W (2002) Hierarchical video summarization for medical data. SPIE: Storage and Retrieval for Media Databases 4676:395–406 53. Hjelsvold R, Midtstraum R (1994) Modelling and querying video data. Proceedings of VLDB Conference CE 19 54. Schreiber ATh, Dubbeldam B, Wielemaker J, Wielinga B (2001) Ontology-based photo annotation. IEEE Intell Syst 16(3):66–74 55. Gruber T (1992) Ontolingua: A mechanism to support portable ontologies. Technical Report KSL-91-66, Knowledge Systems Laboratory, Stanford University, Palo Alto, CA 56. Voorhees E (1994) Query expansion using lexical-semantic relations. Proceedings 17th International Conference on Research and Development in Information Retrieval (ACM SIGIR), CE 20 pp 61–69


CEnote 5 Conference venue? CEnote 6 date of publication? CEnote 7 Volume, issue and page numbers? CEnote 8 Conference location? CEnote 9 date of publication? CEnote 10 Cobnference venue? CEnote 11 Conference location? CEnote 12 Conference location? CEnote 13 Conference location? CEnote 14 Conference location? CEnote 15 Conference location CEnote 16 conference location? CEnote 17 Conference location? CEnote 18 The following references are not cited in the text. Please clarify

CEnote 19 Conference location? CEnote 20 Conference location?

CEnote 1 page numbers? CEnote 2 Conference location? CEnote 3 page numbers? CEnote 4 Conference location?


文档资料共享网 nexoncn.com copyright ©right 2010-2020。