Indexing of media content on a network6282549Abstract A method and apparatus for searching for multimedia files in a distributed database and for displaying results of the search based on the context and content of the multimedia files. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
ITEM WEIGHTING FACTOR
URL of the media file 10
Keywords embedded in the media file 10
Textual annotations in the media file 10
script dialogue, lyrics, and closed 10
captioning in the media file
Text strings associated with the media file 9
anchor reference
Text surrounding the media file reference 7
Title of the HTML document containing 6
the media file
Keywords and meta-tags associated with 6
the HTML document
URL for the HTML document containing the 5
media file reference
In other embodiments, alternative weighting factors may be utilized without departure from the present invention. Store Data for each Media Object Finally, data is stored for each media object. In the described embodiment, the following data is stored: Relevant text HTML document title HTML meta tags Media specific text (e.g., closed captioning, annotations, etc.) Media URL Anchor text Content previews (discussed below) Content attributes (such as brightness, color or B/W, contrast, speech v. music and volume level. In addition, sampling rate, frame rate, number of tracks, data rate, size may be stored). Of course, in alternative embodiments a subset or superset of these fields may be used. Content Analysis As was briefly mentioned above, it is desirable to not only search the lexical content surrounding a media file, but also to search the content of the media file itself in order to provide a more meaningful database of information to search. As was shown in FIG. 1, the present invention is generally concerned with indexing two types of media files (i) audio 102 and (ii) video 103. Video Content The present invention discloses an algorithm used to predict the likelihood that a given video file contains a low, medium or high degree of motion. In the described embodiment, the likelihood is computed as a single scalar value, which maps into one of N buckets of classification. The value associated with the motion likelihood is called the "motion" metric. A method for determining and classifying the brightness, contrast and color of the same video signal is also described. The combination of the motion metric along with brightness, contrast and color estimates enhance the ability of users to locate a specific piece of digital video. Once a motion estimate and brightness, contrast and color estimate exist for all video files located in an index of multimedia content, it is possible for users to execute search queries such as: "find me all action packed videos" "find me all dramas and talk shows" If the digital video information is indexed in a database together with auxiliary text-based information, then it is possible to execute queries such as: "find me all action packed videos of James Bond from 1967" "find me all talk shows with Bill Clinton and Larry King from 1993" Combining motion with other associated video file parameters, users can execute queries such as: "find me all slow moving, black and white movies made by Martin Scorcese" "find me all dark action movies filmed in Zimbabwe" The described method for estimating motion content and brightness, contrast and color can be used together with the described algorithm for searching the worldwide Internet in order to index and intelligently tag digital multimedia content. The described method allows for powerful searching based on information signals stored inside the content within very large multimedia databases. Once an. index of multimedia information exists which includes a motion metric and brightness, contrast and color estimate, users can perform field based sorting of multimedia databases. For example, a user could execute the query: find me all video, from slow moving to fast, by Steven Spielberg, and the database engine would return a list of search results, ordered from slowest to fastest within the requested motion range. In addition, if the digital video file is associated with a digital audio sequence, then an analysis of the digital audio can occur. An analysis of digital audio could determine if the audio is either music or speech. It can also determine if the speaker is male or female, and other information. This type of information could then be used to allow a user query such as: "find me all fast video clips which contain loud music"; "find me all action packed movies starring Sylvester Stallone and show me a preview of a portion of the movie where Stallone is talking". This type of powerful searching of content will become increasingly important, as vast quantities of multimedia information become digitized and moved onto digital networks which are accessible to large numbers of consumer and business users. The described method, in its preferred embodiment, is relatively fast to compute. Historically, most systems for analyzing video signals have operated in the frequency domain. Frequency domain processing, although potentially more accurate than image based analysis, has the disadvantage of being compute intensive, making it difficult to scan and index a network for multimedia information in a rapid manner. The described approach of low-cost computation applied to an analysis of motion and brightness, contrast and color has been found to be useful for rapid indexing of large quantities of digital video information when building searchable multimedia databases. Coupled with low-cost computation is the fact that most video files on large distributed networks (such as the Internet) are generally of limited duration. Hence the algorithms described herein can typically be applied to short duration video files in such a way that they can be represented as a single scalar value. This simplifies presentation to the user. In addition to the image space method described here, an algorithm is presented which works on digital video (such as MPEG) which has already been transformed into a frequency domain representation. In this case, the processing can be done solely by analyzing the frequency domain and motion vector data, without needing to perform the computation moving the images into frequency space. Degree of Motion Algorithm Details (Image Space) In order to determine if a given video file contains low, medium or high amounts of motion, it is disclosed to derive a single valued scalar which represents the video data file to a reasonable degree of accuracy. The scalar value, called the motion metric, is an estimate of the type of content found in the video file. The method described here is appropriate for those video files which may be in a variety of different coding formats (such as Vector Quantization, Block Truncation Coding, Intraframe DCT coded), and need to be analyzed in a uniform uncompressed representation. In fact, it is disclosed to decode the video into a uniform representation, since it may be coded in either an intraframe or an interframe coded format. If the video has been coded as intraframe, then the method described here is a scheme for determining the average frame difference for a pixel in a sequence of video. Likewise, for interframe coded sequences, the same metric is determined. This is desirable, even though the interframe coded video has some information about frame to frame differences. The reason that the interframe coded video is uncompressed and then analyzed, is that different coding schemes produce different types of interframe patterns which may be non uniform. The disclosed invention is based on three discoveries: time periods can be compressed into buckets which average visual change activity the averaged rate of change of image activity gives an indication of overall change an indication of overall change rate is correlated with types of video content The indication of overall change has been found to be highly correlated with the type of video information stored in an video file. It has been found through empirical examination that slow moving video is typically comprised of small frame differences moderate motion video is typically comprised of medium frame differences fast moving video is typically comprised of large frame differences and that, video content such as talking heads and talk shows are comprised of slow moving video video content such as newscasts and commercials are comprised of moderate speed video video content such as sports and action films are comprised of fast moving video The disclosed method operates generally by accessing a multimedia file and evaluating the video data to determine the visual change activity and by algorithm to compute the motion metric operates as follows: A. Motion Estimator if the number of samples N exceeds a threshold T, then repeat the Motion Estimator algorithm below for a set of time periods P=N/T. The value Z computed for each period P is then listed in a table of values. as an optional preprocessing step, employ an adaptive noise reduction algorithm to remove noise. Apply either a flat field (mean), or stray pixel (median) filter to reduce mild and severe noise respectively. if the video file contains RGB samples, then run the algorithm and average the results into a single scalar value to represent the entire sequence B. Motion Estimator determine a fixed sampling grid in time consisting of X video frames if video samples are compressed, then decompress the samples decompress all video samples into a uniform decoded representation adjust RGB for contrast (low/med/high) compute the RGB frame differences for each frame X with its nearest neighbor sum up all RGB frame differences for each pixel in each frame X compute the average RGB frame difference for each pixel for each frame X sum and then average RGB frame differences for all pixels in all frames in a sequence. the resulting value is the motion metric Z. The motion metric Z is normalized by taking Z-NORMAL=Z*(REF-VAL/MAX-DIFFERENCE) where MAX-DIFFERENCE is the maximum difference for all frames. map the value Z into one of five categories low degree of motion moderate degree of motion high degree of motion very high degree of motion Using a typical RGB range of 0-255, the categories for the scalar Z map to: 0-20, motion content, low 20-40, motion content, moderate 40-60, motion content, high 60 and above, motion content, very high A specific example, using actual values, is as follows: number of video frames X=1000 sample size is 8 bits per pixel, 24 bits for RGB average frame difference per frame is 15 the sequence is characterized as low motion Note that when the number of video frames exceeds the threshold T, then the percentage of each type of motion metric category is displayed. For example, for a video sequence which is one hour long, which may consist of different periods of low, moderate and high motion, the resulting characterization of the video file would appear as follows: 40%, motion content low 10%, motion content moderate 50%, motion content high Once the degree of motion has been computed, it is stored in the index of a multimedia database. This facilitates user queries and searches based on the degree of motion for a sequence, including the ability to provide field based sorting of video clips based on motion estimates. Degree of Motion Algorithm Details (Frequency Domain) The method described above is appropriate for those video files which may be in a variety of different coding formats (such as Vector Quantization, Block Truncation Coding, Intraframe DCT coded), and need to be analyzed in a uniform uncompressed representation. The coded representation is decoded and then an analysis is applied in the image space domain on the uncompressed pixel samples. However, some coding formats (such as MPEG) already exist in the frequency domain and can provide useful information regarding motion, without a need to decode the digital video sequence and perform frame differencing averages. In the case of a coding scheme such as MPEG, the data in its native form already contains estimates of motion implicitly (indeed, the representation itself is called motion estimation). The method described here uses the motion estimation data to derive an estimate of motion for a full sequence of video in a computationally efficient manner. In order to determine if a given video file contains low, medium or high amounts of motion, it is necessary to derive a single valued scalar which represents the video data file to a reasonable degree of accuracy. The scalar value, called the motion metric, is an estimate of the type of content found in the video file. The idea, when applied to MPEG coded sequences, is based on four key principles: the MPEG coded data contains both motion vectors and motion vector lengths the number of non-zero motion vectors is a measure of how many image blocks are moving the length of motion vectors is a measure of how far image blocks are moving averaging the number and length of motion vectors per frame indicates degrees of motion The indication of overall motion has been found to be correlated with the type of video information stored in an video file. It has been found through empirical examination that slow moving video is comprised of few motion vectors and small vector lengths moderate video is comprised of moderate motion vectors and moderate vector lengths fast moving video is comprised of many motion vectors and large vector lengths and that, video content such as talking heads and talk shows are comprised of slow moving video video content such as newscasts and commercials are comprised of moderate speed video video content such as sports and action films are comprised of fast moving video An algorithm to compute the motion metric may operates as follows: Motion Estimator (Frequency Domain) if the number of frames N exceeds a threshold T, then repeat the Motion Estimator algorithm below for a set of time periods P=N/T. The value Z computed for each period P is then listed in a table of values. Motion Estimator Algorithm determine a fixed sampling grid in time consisting of X video frames determine the total number of non-zero motion vectors for each video frame determine the average number of non-zero motion vectors per coded block determine the average length of motion vectors per coded block sum and average the number of non-zero motion vectors per block in a sequence as A sum and average the length of non-zero motion vectors per block in a sequence as B compute a weighted average of the two averaged values as Z=W1* A+W2*B the resulting value is the motion metric Z map the value Z into one of five categories low degree of motion moderate degree of motion high degree of motion very high degree of motion Note that when the number of video frames exceeds the threshold T, then the percentage of each type of motion metric category is displayed. For example, for a video sequence which is one hour long, which may consist of different periods of low, moderate and high motion, the resulting characterization of the video file would appear as follows: 40%, motion content low 10%, motion content moderate 50%, motion content high Brightness, Contrast and Color Algorithm Details In order to determine if a given video file contains dark, moderate or bright intensities, it is necessary to derive a single valued scalar which represents the brightness information in the video data file to a reasonable degree of accuracy. The scalar value, called the brightness metric, is an estimate of the brightness of content found in the video file. The idea is based on two key principles: time periods can be compressed into buckets which average brightness activity the buckets can be averaged to derive an overall estimate of brightness level By computing the luminance term for every pixel in a frame, and then for all frames in a sequence, and averaging this value, we end up with an average luminance for a sequence. The same method above can be applied to determining a metric for contrast and color, resulting in a scalar value which represents an average contrast and color for a sequence. Search Results Display Once the motion and brightness level estimates have been determined, the values are displayed to user in tabular or graphical form. The tabular format would appear as shown below: Degree of motion: high Video intensity bright The end result is a simple display of two pieces of textual information. This information is very low bandwidth, and yet encapsulates an extensive processing and computation on the data set. And users can more quickly find the multimedia information. Audio Content Before reviewing an algorithm used by the disclosed embodiment for analyzing audio files in detail, it is worthwhile to briefly turn to FIG. 3A which provides an overview of the process. A digital audio file is initially analyzed 301 and an initial determination is made whether the file is speech 307 or music 302. If the file is determined to be music, in one embodiment, if the file is "noisy", a noise reduction filter may be applied and the analysis repeated 303. This is because a noisy speech file may be misinterpreted as music. If the file is music, an analysis may be done to determine if the music is fast or slow 304 and an analysis may be done to determine if the music is bass or treble 305 based on a pitch analysis. In the case of speech, an analysis might be done to determine if the speech 308 is fast or slow based on frequency and whether it is male or female 309 based on pitch. By way of example, knowing that a portion of an audio track for a movie starring Sylvester Stallone has a fast, male voice, may be interpreted by retrieval software as indicating that portion of the audio track is a action scene involving Sylvester Stallone. In addition, in certain embodiments, it may be desirable to perform voice recognition analysis to recognize the voice into text 310. In some embodiments, the voice recognition capability may be limited to only recognizing a known voice, while in other more advanced embodiments, omni-voice recognition capability may be added. In either event, the recognized text may be added to the stored information for the media file and be used for searching and retrieval. Computation of a Music-speech Metric In order to determine if a given audio file contains music, speech, or a combination of both types of audio, it is disclosed in one embodiment to derive a single valued scalar which represents the audio data file to a reasonable degree of accuracy. The scalar value, called the music-speech metric, is an estimate of the type of content found in the audio file. The idea is based on three key principles: time periods can be compressed into buckets which average amplitude activity the averaged rate of change of amplitude activity gives an indication of overall change an indication of overall amplitude change rate is correlated with types of audio content The indication of overall change has been found to be highly correlated with the type of audio information stored in an audio file. It has been found through empirical examination that music is typically comprised of a continuous amplitude signal speech is typically comprised of a discontinuous amplitude signal sound effects are typically comprised of a discontinuous amplitude signal and that, music signals are typically found to have low rates of change in amplitude activity speech signals are typically found to have high rates of change in amplitude activity sound effects are typically found to have high rates of change in amplitude activity audio comprised of music and speech has moderate rates of change in amplitude activity Continuous signals are characterized by low rates of change. Various types of music, including rock, classical and jazz are often relatively continuous in nature with respect to the amplitude signal. Rarely does music jump from a minimum to a maximum amplitude. This is illustrated by FIG. 3C which illustrates a typical amplitude signal 330 for music. Similarly, it is rare that speech results in a continuous amplitude signal with only small changes in amplitude. Discontinuous signals are characterized by high rates of change. For speech, there are often bursty periods of large amplitude interspersed with extended periods of silence of low amplitude. This is illustrated by FIG. 3B which illustrates a typical amplitude signal 320 for speech. Sometimes speech will be interspersed with music, for example if there is talk over a song. This is illustrated by FIG. 3D which illustrates signal 340 having period 341 which would be interpreted as music, period 342 which would be speech, period 343 music, period 344 speech, period 345 music and period 346 speech. For sound effects, there are often bursty periods of large amplitude interspersed with bursty periods of low amplitude. Turning now to FIG. 3E, if the audio file is a compressed file (which may be in any of a number of known compression formats), it is first decompressed using any of a known decompression algorithm, block 351. A amplitude analysis is then performed on the audio track to provide a music speech metric value. The amplitude analysis is performed as follows: The audio track is divided into time segments of a predetermined length, block 352. In the described embodiment, each time segment is 50 ms. However, in alternate embodiments, the time segments may be of a greater or lesser length. For each segment, a normalized amplitude deviation is computed, block 356. This is described in greater detail with reference to FIG. 3F. First, for each time segment, the maximum amplitude and minimum amplitude is determined, block 351. In the example of FIG. 3B, values range from 0 to 256 (in an alternative embodiment, the values may be based on floating point calculations and may range from 0 to 1.0). For the first interval 321, the maximum amplitude value is shown as 160, for the second interval 322, it is 158 and for the third interval 323, it is 156. Then, the average maximum amplitude and average minimum amplitude is computed for all time intervals, block 352. Again, using the example in FIG. 3B, the average maximum amplitude will be 158. Next, a value MAX-DEV is computed for each interval as the absolute value of maximum amplitude for the interval minus the average maximum, block 353. For the first interval of FIG. 3b, the MAX-DEV will be 2, for the second interval, it will be 0 and for the third interval, it will be 2. Finally, the MAX-DEV is normalized by computing MAX-DEV*(REF-VALUE/MAX) where the reference value is 256 in the described embodiment (and may be 1.0 in a floating point embodiment) and MAX is the maximum amplitude for all of the intervals. Thus, for the first interval, the normalized value for MAX-DEV will be 160-(256/160)=256. Normalizing the deviation value provides for removing dependencies based on volume differences in the audio files and allows for comparison of files recorded at different volumes. Finally, the normalized MAX-DEV values for each segment are averaged together, block 357, to determine a music-speech metric. High values tend to indicate speech, low values tend to indicate music and medium values tend to indicate a combination, block 358. It should be noted that if for efficiency, only a portion of the audio file may be analyzed. For example, N seconds of the audio file may be randomly chosen for analysis. Also, if the audio file contains stereo or quadraphonic samples, then run the algorithm described above may be run on each channel, and the results averaged into a single scalar value to represent the entire sequence. Note also that when the number of samples exceeds the threshold T, then the percentage of each type of music-speech metric category may computed and displayed. For example, for a soundtrack which is one hour long, which may consist of different periods of silence, music, speech and sound effects, the resulting characterization of the audio file would appear as follows: 40%, music content: high, speech content: low 10%, music content: high, speech content: medium 10%, music content: medium, speech content: medium 10%, music content: medium, speech content: high 30%, music content: low, speech content: high Volume Algorithm Details In order to determine if a given audio file contains quiet, soft or loud audio information, it is disclosed to derive a single valued scalar which represents the volume information in the audio data file to a reasonable degree of accuracy. The scalar value, called the volume level metric, is an estimate of the volume of content found in the audio file. The idea is based on three key principles: time periods can be compressed into buckets which average volume activity the buckets can be averaged to derive an overall estimate of volume level In general, the disclosed algorithm provides for determining the volume level of data in an audio file by evaluating the average amplitude for set of sampled signals. In particular, the disclosed algorithm comprises the steps of: if the number of samples N exceeds a threshold T, then repeat the Volume Audio Channel Estimator algorithm, below, for a set of time periods P=N/T. The value Z computed for each period P is then listed in a table of values. if the audio file contains mono samples, then run the algorithm on a single channel if the audio file contains stereo samples, then run the algorithm on each channel, and average the results into a single scalar value to represent the entire sequence if the audio file contains quadrophonic samples, then run the algorithm on each channel, and average the results into a single scalar value to represent the entire sequence The algorithm used by the described embodiment for volume estimation is then given by FIG. 3G as follows: if audio samples are compressed, then decompress the samples into a uniform PCM coded representation, block 361. The audio track is mapped into X time segment buckets, 362. determine the total number of audio samples N, block 366. The samples will get mapped into time segment buckets, block 367. The mapping is such that a single bucket represents N/X samples of sound and the N/X samples is called a compressed time sample C Compute the average amplitude value for each bucket X, 368 by summing up all amplitude values within C and dividing to obtain an average amplitude. compute the average amplitude A for all X buckets, block 369 the resulting value is volume estimate A map the value A into one of five categories: quiet soft moderate loud very loud Using a typical maximum amplitude excursion of 100, the categories for A map to: 0-50, quiet 50-70, soft 70-80, moderate 80-100, loud 100-above, very loud It will be apparent to one skilled in the art that alternate "bucket sizes" can be used and the mapping may be varied from the mapping presented in the disclosed algorithm without departure from the spirit and scope of the invention. When the number of samples exceeds the threshold T, then the percentage of each type of volume category is displayed. For example, for a soundtrack which is one hour long, which may consist of different periods of silence, loudness, softness and moderate sound levels, the resulting characterization of the audio file would appear as follows: 30%, quiet 20%, soft 5%, moderate 10%, loud 35%, very loud Search Results Display Once the music-speech and volume level estimates have been determined, the values are displayed to the user in tabular or graphical form The tabular format may appear as shown below: Music content: high Speech content: low Volume level: loud The end result is a simple display of three pieces of textual information. This information is very low bandwidth, and yet encapsulates an extensive processing and computation on the data set. And users can more quickly find the multimedia information they are looking for. Waveform Display A focus of the method described herein is to generate a visual display of audio information which can aid a user to determine if an audio file contains the audio content they are looking for. This method is complements the other types of useful information which can be computed and or extracted from digital audio files; the combination of context and content analysis, together with graphical display of content data results in a composite useful snapshot of a piece of digital media information. As users need to sift through large quantities of music, sound effects and speeches (on large distributed networks such as the Internet) it will be useful to process the audio signals to enhance the ability to distinguish one audio file from another. The use of only keyword based searching for media content will prove to be increasingly less useful than a true analysis and display of the media signal. The algorithm described herein is used to display a time compressed representation of an audio signal. The method is focused on providing some high level features visually of the time varying sound signal. The method described can allow users to: differentiate visually between music and speech observe periods of silence interspersed with loud or soft music/speech observe significant changes in volume level identify extended periods in an audio track where volume level is very low or high Using a multimedia search engine it is possible for users to execute a query such as: "find me all soft music by Beethoven from the seventeenth century" The results returned might be a set of fifty musical pieces by Beethoven. If the searcher knows that the piece of music they are looking for has a very quiet part towards the end of the piece, the user could view the graphical representation and potentially find the quiet part by seeing the waveform display illustrate a volume decrease towards the end of the waveform image. This could save the searcher great amounts of time that would have been required to listen to all fifty pieces of music. Using a multimedia search engine it is possible for users to execute a query such as: "find me all loud speeches by Martin Luther King" A searcher might be looking for a speech by Martin Luther King, where the speech starts out with him yelling loudly, and then speaking in a normal tone of voice. If twenty speeches are returned from the search engine results, then the searcher could visually scan the results and look for a waveform display which shows high volume at the beginning and then levels off within the first portion of the audio signal. This type of visual identification could save the searcher great amounts of time which would be required to listen to all twenty speeches. Continuous signals are characterized by low rates of change. Various types of music, including rock, classical and jazz are often relatively continuous in nature with respect to the amplitude signal. Rarely does music jump from a minimum to a maximum amplitude. Similarly, it is rare that speech results in a continuous amplitude signal with only small changes in amplitude. Discontinuous signals are characterized by high rates of change. For speech, there are often bursty periods of large amplitude interspersed with extended periods of silence of low amplitude. For sound effects, there are often bursty periods of large amplitude interspersed with bursty periods of low amplitude. These trends can often be identified computationally, or visually, or using both methods. A method is illustrated here which derives a visual representation of sound in a temporally compressed format. The goal is to illustrate long term trends in the audio signal which will be useful to a user when searching digital multimedia content. Note that the method produces visual images of constant horizontal resolution, independent of the duration in seconds. This means that temporal compression must occur to varying degrees while still maintaining a useful representation of long term amplitude trends within a limited area of screen display. An algorithm, as used by the described embodiment, to compute and display the waveform operates as follows: A. Waveform Display if the number of samples N exceeds a threshold T, then repeat the Waveform Display algorithm below for a set of time periods P=N/T. A different waveform is computed for each time period. if the audio file contains mono samples, then run the algorithm on a single channel if the audio file contains stereo samples, then run the algorithm on each channel, and display the results for each channel if the audio file contains quadraphonic samples, then run the algorithm on each channel, and display the results for each channel B. Waveform Display Algorithm determine a fixed sampling grid in time consisting of X buckets if audio samples are compressed, then decompress the samples decompress all audio samples into a uniform PCM coded representation determine the total number of audio samples N determine the number of samples which get mapped into a single bucket the mapping is that a single bucket represents N/X samples of sound the N/X samples term is called a compressed time sample C compute the minimum, maximum and average amplitude value for each bucket X display an RGB interpolated line from the minimum to the maximum amplitude the line passes through the average amplitude red represents maximum amplitude green represents average amplitude blue represents minimum amplitude the interpolation occurs using integer arithmetic the line is rendered vertically from top to bottom within each bucket X compress the resulting waveform using a DCT based compression scheme (or alternate) Note that when the number of samples exceeds the threshold T, then a series of waveform displays are computed. For example, for a soundtrack which is one hour long, which may consist of different periods of silence, music, speech and sound effects, the resulting waveform display characterization would need to be broken up into segments and displayed separately. The ability to scan through these displays would then be under user control. Additional Processing and Analysis of Audio Files After a digital audio file has been classified as music, speech or a combination of the two, additional processing and analysis can be applied in order to extract more useful information from the data. This more useful information can be used to enhance the ability of users to search for digital audio files. For the case of audio files which have been classified as music, with some degree of speech content (or which have been classified as speech, with some degree of music content) one can assume that there is a speaking or singing voice within the audio file accompanied with the music. A conventional speech recognition algorithm can then be applied (also called speech to text) which can convert the speech information in the audio file into textual information. This will allow the audio file to then be searchable based on its internal characteristics, as well as the actual lyrics or speech narrative which accompanies the music. For the case of audio files which have been classified as speech, one can assume that there is a reasonable certainty of a speaking voice within the audio file. A conventional speech recognition algorithm can then be applied (also called speech to text) which can convert the speech information in the audio file into textual information. This will allow the audio file to then be searchable based on its internal characteristics, as well as the actual narrative which is within the audio file. The speech may correspond to closed captioning information, script dialogue or other forms of textual representation. Determining if a Given Music File Contains Fast or Slow Music When an audio file is first examined, a determination can be made if the audio data is sampled and digitized, or is completely synthetic. If the data has been digitized, then all of the processes described above can be applied. If the data has been synthesized, then the audio file is MIDI information (Musical Instrument Digital Interface). If a file has been identified as MIDI, then it is possible to scan for information in the file regarding tempo, tempo changes and key signature. That information can be analyzed to determine the average tempo, as well as the rate of change of the tempo. In addition, the key signature of the music can be extracted. The tempo, rate of change of tempo and key signature can all be displayed in search results for a user as: tempo: (slow, moderate, fast) rate of change of tempo (low, medium, high) indicates if the music changes pace frequently key signature key of music indication of minor and major key Note that when the number of samples exceeds the threshold T, then the percentage of each type of tempo category is displayed. For example, for a soundtrack which is one hour long, which may consist of different periods of fast, moderate or slow tempo levels, the resulting characterization of the music file would appear as follows: 30%, slow 20%, moderate 20%, fast 30%, very fast Previews The described embodiment is concerned with parsing content files and building low-bandwidth previews of higher bandwidth data files. This allows rapid previewing of media data files, without need to download the entire file. Preview Overview In the described embodiment, for video media files, a preview mechanism has been developed. A sample of the results of search, showing a media preview is given in FIG. 4A. The preview is explained in greater detail with reference to FIG. 4B. FIG. 4B illustrates a preview 410. The preview comprises a first sprocket area 411 at the top of the preview and a second sprocket area at the bottom of the preview, a image area having three images of height IH 412 and width IW 413. The preview itself is of height FH 414 and width FW 415. In addition, in certain embodiments, the preview may include a copyright area 416 for providing copyright information relating to the preview and certain embodiments may contain an area, for example in the upper left hand comer of the first sprocket area 411 for a corporate logo or other branding information. A general algorithm for generation and display of previews is disclosed with reference to FIG. 4C. Generally, after finding a media object, as was discussed above in Section 1 in connection with crawling to locate media files, the media file is examined to locate portions having predetermined characteristics. For example, portions of a video file having fast action may be located. Or, portions of a video having black and white portions. Next, a preview of the object is generated and stored. This will be discussed in greater detail in connection with FIG. 4D. Finally, when requested by a user, for example, in response to a search, the preview may be displayed. Preview Generation Turning now to FIG. 4D, the process for generation of a preview is discussed in greater detail. Initially, a determination is made of the object type, block 431. The object may be, for example, a digital video file, an animation file, or a panoramic image. In the case of digital video, as was discussed above, the file may be downloadable or streaming. And, if downloadable, the file may have table based frame descriptions or track based frame descriptions. Animation objects include animated series of frames using a lossless differential encoding scheme and hyperlinked animation. Regardless of the media type, a preview is generated generally along the lines of the preview of FIGS. 4A and 4B, block 432. Sizing of Preview and Images The sizing of the preview and of images is done in the described embodiment as follows: A) Initially, an aspect ratio is computed for the preview. The aspect ratio is computed as the width of a frame of the object divided by the height or A=W/H. B) The target filmstrip is set with a width FW 415 and a height FH. A distance ID is set for the distance between images on the filmstrip. Next, a a sprocket height and width is set resulting in a sprocket region height (SRH 411). The particular heights and widths may vary from implementation to implementation dependent on a variety of factors such as expected display resolution. In alternative embodiment, differing sprocket designs may be utilized and background colors, etc. may be selected. In certain embodiments, it may be desirable to include copyright information 416 or a logo. C) In any event, the target height IH 412 of a filmstrip image can be computed as IH=FH-(2*SRH). The target width of an image can be computed as a function of the aspect ratio as follows: IW=A*IH. The number N of filmstrip images which will be displayed can them be computed as N=FW/(IW+ID). Using the above calculations, the number, width and height of images can be determined for display of a preview. Selection of Images The selection images for use in the preview is dependent on whether the preview is being generated for a 3D media object, a digital video or animation object, or a panoramic object. Selection of Images--Digital Video and Animation For digital video or animation sequences, a temporal width TW is calculated, block 442, as TW=T/(N+1) where T is equal to the length (time) of the media object and N is the number of frames calculated as discussed above. N frames from the image are then decompressed to pure RGB at N fixed points in the image where the N fixed points at TW, 2*TW, 3*TW, . . . N* TW time into the media image. This process reduces the need to decompress the entire video file. Scanning to the particular ones of the N frames is accomplished by using the table based frame description, the track based frame description or by streaming dependent on the media source file. An objective of choosing N frames spaced TW apart is to develop a preview with frames from various portions of the media file so that the user will be given an opportunity to review the various portions in making a determination if the user wishes to access the entire file. The decompress process may utilize intraframe, predictive decoding or bilinear decoding dependent on the source file. In the described embodiment, a color space conversion is then performed from RGB to YUV. Optionally, an adaptive noise reduction process may be performed. Each of the N frames are then analyzed to determine if the frame meets predetermined criteria for display, block 444. Again, an objective is to provide the user with a quality preview allowing a decision if the entire file should be accessed. In the described embodiment, each of the N frames are analyzed for brightness, contrast and quality. If the frames meet for the criteria, block 445, then the frame is scaled, block 447 from its original width W and height H to width IW and height IH using interpolation. Linear interpolation is utilized and the aspect ratio is maintained. Each frame is also analyzed for a set of attributes, block 448. The attributes in the described embodiment include brightness, contrast (luminance, deviation), chrominance, and dominant color. Brightness indicates the overall brightness of digital video clip. Color indicates if the video clip is in full color or black and white, and contrast indicates the degree of contrast in the movie. These high level content attributes tend to be more meaningful for the typically short video sequences which are published on the Internet and Intranet. The computation for each of the attributes is detailed below. This information can then be used for enhanced searching. For example, chrominance can be used for searching for black and white versus color video. In addition, embodiments may provide for optionally storing a feature vector for texture, composition and structure. These attributes can be averaged across the N frames and the average for each attribute is stored as a searchable metric. In addition, optionally, the contrast of the frames may be enhanced using a contrast enhancement algorithm. We will now briefly describe computation of the chrominance, luminance and contrast values. The maximum chrominance is computed for the selected N frames in the video sequence. The maximum chrominance for the set of frames is then determined by finding the maximum chrominance for each frame by finding the maximum chrominance for all pixels in each frame. This maximum chrominance value for the set of selected frames is then compared against a threshold. If the maximum chrominance for the sequence is larger than the threshold, then the sequence is considered in full color. If the maximum chrominance for the sequence is smaller than the threshold, then the sequence is considered in black and white. The luminance is computed for the selected N frames in the video sequence. The luminance is then averaged into a single scalar value. To determine contrast, luminance values are computed for each frame of the digital video sequence. The luminance values which fall below the fifth percentile, and above the ninety-fifth percentile are then removed from the set of values. This is done to remove random noise. The remaining luminance values are then examined for the maximum and minimum luminance. The difference between the maximum and minimum luminance is computed as the contrast for a single frame. The contrast value is then computed for all frames in the sequence, and the average contrast is stored as the resulting value. Finally, audio and video clips may be associated with each frame, block 449. For audio, a standard audio segment may be selected or alternatively an audio selection algorithm may be applied which finds audio which meets predetermined criteria, such as a preset volume level. For video, a video track of duration VD is selected. The video selection may be a standard video segment or the video segment may be selected using a video selection algorithm which selects video segments meeting a predetermined criteria such as video at a predetermined brightness, contrast or motion. Going back to analysis of the frames, if one of the N frames does not meet the criteria, block 445, a frame iterator algorithm is applied to select a new frame. The frame iterator algorithm of the described embodiment selects another frame by iteratively selecting frames between the frame in question and the other frames until a frame is found which meets the criteria or until a predetermined number of iterations have been applied. If the predetermined number of iteration are applied without successfully finding a frame which meets the criteria, the originally selected frame is used. The algorithm starts with the original frame at TW (or, 2*TW, 3*TW . . . N*TW) and selects, first, a frame at TW-(TW/2) (i.e., a frame halfway between the original frame and the beginning). If this frame does not meet the criteria, a frame at TW+(TW/2) is selected and iteratively frames are selected according to the pattern: ((TW-(TW/2)), (TW+(TW/2), (TW-(TW/4)), (TW+(TW/4), . . . (TW-(TW/X)), (TW+(TW/X)). Selection of Images--Panoramic Interactive panoramic images are often stored as multimedia files. Typically these media files are stored as a series of compressed video images, with a total file size ranging from 100 to 500 Kbytes. The described embodiment provides a method which creates a low bandwidth preview of a panoramic picture. A preview, in the described embodiment utilizes approximately 10 Kbytes in storage size which is only 1/10th to 1/50th of the original panoramic storage. The preview provides a high-quality, low bit rate display of the full panoramic scene. In the described embodiment, the method for creating the panoramic preview, block 450, may be described as follows: 1) extract all information from the header of the media file to determine the width, height and number of tiles for the panoramic scene. Create an offscreen buffer to generate the new panoramic picture preview. 2) For each tiled image on the media file, decode the image using the coding algorithm which was used to encode the original tiles. The decoded files are converted to pure RGB and then to YUV. The tiles are scaled from (W, H) to (IW, IH) similar to as discussed above. In other embodiments, as a next step, the image may be scaled by a factor of two in each direction. 3) Re-orient the tile by rotating it 90 degrees clockwise. 4) For each scaled and rotated tile, copy the image (scanning from right to left) into the offscreen buffer. 5) In the case of embodiment which scales by the factor of two, when all tiles have been processed, examine the resulting picture size after it has been reduced by a factor of two. If the image is below a fixed resolution then the process is complete. If the image is above a fixed resolution, then reduce the picture size again by a factor of two, until it is less than or equal to the fixed resolution. 6) Composite the reconstructed panoramic picture with filmstrip images on the top and bottom of the picture to create a look and feel consistent with the filmstrip images for video sequences. 7) Any of a number of known compression algorithms may be applied to the reconstructed and composited panoramic picture to produce a low bandwidth image preview. Coding schemes can include progressive or interlaced transmission algorithms. Selection of Images--3D For 3D images, a top view, bottom view, front view and rear view are selected for images to display, block 441. Interactive Display of Search Results When returning search results from a user's multimedia query to a database, it is disclosed to generate appropriate commands to drive a web browser display to facilitate interactive viewing of the search results. Depending on the position a user selects (for example with a mouse or other cursor control device) within a preview of the media content shown in the search result, the user will begin interaction with the content at different points in time or space. The end result is a more useful and interactive experience for a user employing a multimedia search engine. For example, if a user searches for videos of a car, then the web server application can return a series of HTML and EMBED tags which setup a movie controller, allowing a user to interact with the videos of cars. When the low bandwidth preview (a filmstrip showing select scenes of the video clip) is presented to a user, the position of the mouse that is active when a user clicks within the preview will drive the resulting EMBED tags which are created and then returned from the server. For example: if a user clicks down in frame X of a filmstrip, then an in-line viewer is created which will begin display and playback of the movie at frame X. In an alternative embodiment, a snipet or short segment of a video or audio file may be stored with the preview and associated with a particular portion of the preview. This method avoids the need to access the original file for playback of a short audio or video segment. if a user clicks down at pan angle X, tilt angle Y and fov Z within a panorama filmstrip, then an in-line viewer is created which will begin display of the panorama at those precise viewing parameters. if a user clicks down within a select viewpoint of a 3D scene within a filmstrip, then an in-line viewer is created which will begin display of the 3D scene at that viewpoint. if a user clicks down within an audio waveform at time T, then an in-line viewer is created which will begin begin playback of the sound at that particular time T. By allowing users to drive the points in time or space where their display of interactive media begins, users can more precisely hone in on the content they are looking for. For example, if a user is looking for a piece of music which has a certain selection which is very loud, they may observe the volume increase in the graphical waveform display, click on that portion of the waveform and then hear the loud portion of the music. This takes them directly to the selection of interest. Use of Media Icons to Illustrate Search Results When returning search results from a user's multimedia query to a database, the described embodiment provides for both a text and visual method for showing that the search results are of different media types. For example, when executing a search for the word "karate" it is possible that numerous search results will be returned, including digital video, audio, 3D, animation, etc. Video may show karate methods, sound might be an interview with a karate expert, 3D could be a simulation of a karate chop and animation a display of a flipbook of a karate flip. In order to enable a viewer to rapidly scan a page and distinguish the different media types, an icon which is representative of each type of media is employed. By using a universal set of icons as shown in the figures for media types, it enhance the ability of users to scan a page of search results and quickly jump to those responses which are most relevant. In addition, the use of media icons can transcend barriers of language and culture, making it easier for people from different cultures and speaking different languages to understand search results for multimedia queries. Selection of Basic, Detailed or Visual Results In the described embodiment, users can select basic, detailed or visual search results. If a user selects visual search results, then only visual images, filmstrips or waveforms are presented to users as search results. The visual search results are typically displayed as a set of mosaics on a page, usually multiple thumbnail images per row, and multiple filmstrips (usually two) per row. Clicking on images, waveforms or filmstrips then takes users to new web pages where more information is described about the media content. This allows users to rapidly scan a page of visual search results to see if they can find what they are looking for. Timecode Based Display Text keywords may be found within certain multimedia files (e.g., the content of the file). For example, movie and other video files sometimes contain a movie text track, a closed caption track or a musical soundtrack lyrics track. For each text keyword which is found in one of these tracks, a new database is created by the process of the present invention. This database maps keywords to [text, timecode] pairs. This is done so that it is possible to map keywords directly to the media file and timecode position where the media file text reference occurs. The timecode position is subsequently used when producing search results to viewers, so that viewers can jump directly to that portion of the media sequence where the matching text occurs. Alternatives to the Preferred Embodiment of the Present Invention There are, of course, alternatives to the described embodiment which are within the reach of one of ordinary skill in the relevant art. The present invention is intended to be limited only by the claims presented below. Appendix A MediaX File Format The mediaX file provides a hierarchy of information to describe multimedia content files to enhance the ability to search, preview and sample multimedia content. The categories of information which are stored in the mediaX file are as follows: 1. Media Types 2. Media Makers and Creators 3. Media Copyrights and Permission 4. Media Details 5. Media Description information 6. Media Content Attributes 7. Media Location 8. Media Previews 9. Media Commerce 10. Media Editing and Creation 11. Media Playback 12. Media Size, Time and Speed 13. Media Contact Information 1. Media Types multimedia data type i.e, sound, video, MIDI, panorama image, animation multimedia file format type i.e., mov (QuickTime), ra (RealAudio), mpg (MPEG) 2. Media Makers and Creators title of multimedia file (i.e., "Beethoven's Fifth", the "The Simpsons") author of content information director producer performers writers screenplay list of characters in cast studio of production biographies of characters narrator composer cinematographer costume designers editors mixer additional credits list of physical locations of creation of content list of physical locations of sound recording and mix 3. Media Copyrights and Permission copyright type holder information type of copyright copyrighted media copyleft media, e.g. freely distributed data with indicator of source present unknown copyright copyright holder information i.e., "Disney" indication if multimedia data can be played at a third party search site a "permission to show" bit is used in the database if content can be shown remotely, specify format of listing: text-only image previews filmstrips motion previews in-line viewing of full content note that each representation includes all previous representations watermark field of arbitrary length to describe watermark usage in content 4. Media Details number of tracks in the multimedia file type and ID of each track in the multimedia file indication of multimedia file format details indication if content is digitized, synthesized, etc. credits for those providing source content language of the multimedia sound data (if available) i.e., Chinese, Spanish, etc. video parameters: frame rate specific compression algorithm which is used sound parameters: number of channels sampling rate sampling bit depth types of sound channels stereo, mono, etc. specific compression algorithm which is used indicate if material can be streamed indicate if material is seekable aspect ratio interlaced vs, progressive scan black and white or color 5. Media Description Information Abract describing the content in detail summary typically at least one full sentence and preferably a full paragraph of information about the content. keywords describing the content, up to 100 time period style category genre MPAA rating G, PG, R, X, etc. Intenet rating G, PG, R, X, etc. indication if media file contains closed caption information can be within media file, within mediaX, or not available closed caption list is in mediaX if not in media file closed caption format is [text, timecode] pairs in ASCII indication if music file contains lyric information can be within media file, within mediaX, or not available lyrics are included in mediaX if not in media file lyric format is [text, timecode] pairs in ASCII indication if media file contains cue-point information can be within media file, within mediaX, or not available cue-points are included in mediaX if not in media file cue-point format is [text, timecode] pairs in ASCII cue points are text strings describing a time period in a media file indication if media file contains embedded URL information can be within media file, within mediaX, or not available embedded URLs are included in mediaX if not in media file embedded URL format is [text, timecode] pairs in ASCII embedded URL format is [text, XYZ] pairs in ASCII embedded URLs are text strings describing a time period in a media file media user data arbitrary number of fields of arbitary length 6. Media Content Attributes indication that content is music indication of single value or list of time periods with associated scalars if single, then scalar percentage from 0 to 100, 100 is max if list, then list of [time, scalar] indication that content is speech/sound effects indication of single value or list of time periods with associated scalars if single, scalar percentage from 0 to 100, 100 is max if list, then list of [time, scalar] indication of volume level indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max if list, then list of [time, scalar] degree of motion indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max if list, then list of [time, scalar] degree of brightness indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max if list, then list of [time, scalar] degree of contrast indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max if list, then list of [time, scalar] degree of chrominance indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max note that 0 is black and white if list, then list of [time, scalar] average RGB color indication of single value or list of time periods with associated scalars if single, scalar value from 0 to 100, 100 is max one scalar value per channel of color if list, then list of [time, scalar] 7. Media Location multimedia file reference (i.e., ftp:name.movie or http://name.movie) 8. Media Previews key frame index compressed frame index preview image for digital video clips [optional] location of preview image as offset from file start target image resolution for preview image width and height filmstrip for digital video clips [optional] location of filmstrip images as offset from file start target image resolution for filmstrip width and height motion preview for digital video clips [optional] location of start and end frames from file start target image resolution for motion preview width and height 9. Media Commerce indication if multimedia data is for sale electronically if for sale, which payment mechanism Cybercash, First Virtual, Digicash, etc. if for sale, the price of the content can be stored in the index price in dollars and cents if for sale, indicate usage model pay to download full version pay to license the content for re-use at a web site 10. Media Editing and Creation date of multimedia file creation TIME DAY MONTH YEAR", in ASCII, such as "00:00:00 PDT XX Y ZZZZ date of multimedia file encoding TIME DAY MONTH YEAR", in ASCII, such as "00:00:00 PDT XX Y ZZZZ date of last modification to multimedia file TIME DAY MONTH YEAR", in ASCII, such as "00:00:00 PDT XX Y ZZZZ edit history 11. Media Playback name of software tools required to playback/display the content i.e., for QuickTime use "MoviePlayer", etc. URL to obtain the multimedia file player 12. Media Size, Time and Speed data size (in bytes) of the multimedia file data rate (in bytes/sec) of the multimedia file required for playback date rate (in bytes/sec) required for playback for each track duration of material (hours, minutes, seconds, tenths of seconds) 00:00:00:00 download time for different speed connections 13. Media Contact Information domain name where the multimedia file resides name of the web page where the multimedia file resides e-mail addresss for the web site where the multimedia file resides address and phone number associated with multimedia file, if available
|
Same subclass Same class Consider this |
||||||||||
