## Sunju.org

Randomized Locality Sensitive Vocabularies for
Bag-of-Features Model?
Yadong Mu1, Ju Sun1,2, Tony X. Han3, Loong-Fah Cheong1,2, Shuicheng Yan1
1Electrical and Computer Engineering, National University of Singapore, Singapore
2Interactive & Digital Media Institute, National University of Singapore, Singapore
3Electrical and Computer Engineering, University of Missouri-Columbia, USA
Email: {elemy, idmsj, eleclf, eleyans}@nus.edu.sg,

[email protected]
Abstract. Visual vocabulary construction is an integral part of the pop-ular Bag-of-Features (BOF) model. When visual data scale up (in termsof the dimensionality of features or/and the number of samples), mostexisting algorithms (e.g. k-means) become unfavorable due to the pro-hibitive time and space requirements. In this paper we propose the ran-dom locality sensitive vocabulary (RLSV) scheme towards efficient visualvocabulary construction in such scenarios. Integrating ideas from theLocality Sensitive Hashing (LSH) and the Random Forest (RF), RLSVgenerates and aggregates multiple visual vocabularies based on randomprojections, without taking clustering or training efforts. This simplescheme demonstrates superior time and space efficiency over prior meth-ods, in both theory and practice, while often achieving comparable oreven better performances. Besides, extensions to supervised and kernel-ized vocabulary constructions are also discussed and experimented with.

The bag-of-features (BOF) model (also known as the bag-of-words) has gainedmuch empirical success in producing orderless representations of feature-rich vi-sion data. In this model, in order to obtain uniform representations for featuresets of varying cardinalities, one performs feature quantization for each primi-tive feature referring to a learnt "visual vocabulary", and then summarizes eachset into histograms. Despite its simplicity, the BOF model has shaped the cur-rent paradigms towards many high-level vision problems, e.g., object recognitionand content-based image retrieval (CBIR). In fact, state-of-the-art approachesto object recognition are to first extract local descriptors such as SIFT frominterest points (i.e., local extrema in scale space pyramid) in images, and thendevise Mercer kernels such as Pyramid Match Kernels (PMK) or HistogramIntersectional Kernels (HIK) between pairwise feature histograms. Finally, so-phisticated classifiers such as the Supporting Vector Machines (SVM) are usedfor per-sample decision.

? Support of IDMPO Grant R-705-000-018-279 Singapore and NRF/IDM Program
under research Grant NRF2008IDMIDM004-029 are gratefully acknowledged.

Authors Suppressed Due to Excessive Length
Visual vocabulary construction is critical to the BOF model. In ideal cases,
every visual word in the vocabulary bears concrete meaning, or semantic, in-spired from the similar idea of bag-of-words models in text analysis. In practice,however, quality of the vocabulary depends on numerous factors associated withthe vocabulary creation process, such as the source of samples, the number ofvisual words specified, and the similarity metric. Hence, a practical criterion fora proper visual vocabulary could be visual features near to the same visual wordbear some similarities. This in essence turns the vocabulary construction prob-lem into partitioning the visual feature space according to a few data samples.

Numerous methods (as described in have been proposed to address thispartition problem.

Complications associated with vision data, however, stem from the typical
scale issue including huge amount and high dimensionality. For example, thepopular SIFT descriptor used for local visual feature description has 128 di-mensions, while a typical image at normal resolution can produce 1k ∼ 2k suchprimitive feature descriptors. The complexity quickly explodes for real-world vi-sion databases that usually consist of millions or even billions of such images.

Moreover, peculiarities with high dimensionality, such as concentration of dis-tribution show up frequently and deserve special investigations. Questinginto large-scale problems, most techniques that work nicely in low-dimensionalspaces with small amount of data may not healthily scale up. This calls for novelsolutions that are efficient and dedicate for large-scale data while producing ac-ceptable levels of performance.

There are intimate connections between the vocabulary construction and theclustering/space partitioning problem, the latter of which is widely studied inmachine learning. Hence various unsupervised clustering methods have been ap-plied to this particular problem, among which k-means is the mostpopular. Other methods include mean-shift tree-based coding (e.g., tree in-duced by hierarchical k-means) On the other hand, for problems where super-vision or prior knowledge are available, e.g., labels or segmentations for images,sparsity in representation, supervision is applied to partially guide the unsuper-vised clustering (e.g., random forest based methods such as ERC-Forest or to learn informative vocabularies directly Nevertheless, most ofthe supervised vocabulary learning techniques are very expensive and unlikelyto scale up nicely. Hence we will focus on unsupervised clustering techniques(e.g. k-means) and extremely efficient and flexible supervised techniques (e.g.

the random forest family).

Sparse coding and its various applications are perhaps the most inviting
work into high-dimensional spaces for vision research. In fact for clustering reveals that in high-dimensional spaces, clustering methods such as k-means tend
1 Hereafter we default k-means to the hierarchical k-means algorithm due to its effi-
ciency until otherwise stated.

Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
to return many similar centers to a query, which makes these methods ratherunstable. This is partially explained by the concentration of finite number ofsamples as reported by many authors, e.g. In this case, aggregation ofdistinct clustering results prove useful to enhancing the stability. This is theunderlying principle for the random forest method which also inspires ourmethod.

Randomized algorithms have been widely studied and analyzed in algorithm
designs . For large-scale problems, has tried to contrast the random pro-jection with the classic PCA technique. Moreover random vectors sampled fromthe unit sphere have served as the hashing vectors for one family of LSH In this vein, there is fruitful research work on LSH theory and appli-cations (e.g., image database indexing and searcOur method build fromthe idea of LSH.

We propose and empirically validate the idea of randomized locality sensitivevocabulary (RLSV), which is inspired by both RF and LSH. Instead of build-ing visual vocabularies by optimizing a global objective, our proposed methodgenerates visual words by applying a sequence of random linear bipartitions.

Assume all samples are originally embedded in a specific metric space (e.g.,Hamming space, p-normed space). For any sample pair, theoretic analysis ofLSH guarantees that their collision probability in the resulting vocabulary istightly related to the pairwise similarity or distance. Furthermore, to reduce theinherent randomness, multiple random vocabularies are independently createdusing the same method, and the final inter-sample distance or similarity is basedon the consensus of all vocabularies.

Compared with existing methods, the proposed RLSV has the following mer-
its: 1) No time-consuming training or clustering stage in RLSV. The major com-putational cost comes from the random hash function generation and histogrambinning operations, which can be efficiently handled. 2) Noise resistance andstability by exploiting the ensemble technique. RLSV generates an ensembles ofseveral vocabularies to mitigate the effect of randomness and noise. 3) As statedabove, nearest-prototype based methods (e.g., k-means) tend to suffer from theso-called curse of dimensionality problem. In contrast, the performance of RLSVis stable for high-dimensional data. 4) Compared to methods such as RF whichrequire supervision, RLSV is unsupervised in nature but can be readily extendedto supervised and kernerlized cases.

Randomized Locality Sensitive Vocabularies
In this section and the next we elaborate on the proposed vocabulary construc-tion algorithm, and discuss its relationship to existing methods. We providetheoretic analysis on the time and space complexity in contrast with k-meansand RF, and also present extensions to the supervised or kernelized cases.

Authors Suppressed Due to Excessive Length
A key ingredient to many visual recognition and retrieval applications is tosearch k-nearest-neighbors (k-NN) for the query (or testing sample). In manycases the performance of the whole system heavily hinges on the efficiency ofthe k-NN procedure. In practice, exact k-NN is somewhat unnecessary, sincein many scenarios approximate nearest neighbors (ANN) results in nearly thesame performance. Several efficient algorithms are known for low-dimensionalcases (e.g., up to 10 to 20), such as the kd-tree algorithm. However, forhigh-dimensional cases, these methods often provide little improvement overa linear scan algorithm, which is known to be the phenomenon "the curse ofdimensionality". In recent years, various locality sensitive hashing (LSH) methods are proposed to tackle this problem.

Let H be an LSH family defined on metric space
following relationship
∀h ∈ H, P r[h(x) = h(y)] = κ(x, y),
where κ(·, ·) denotes the similarity measure between samples x and y. In otherwords, x and y's collision probability (i.e., being mapped to the same hashbucket) is monotonically increasing with their similarity value, which is knownas the "locality sensitive" property. Several LSH families have been developedfor various distances (or similarities). Here we list some representative work:Arccos distance: for real-valued feature vectors lying on hypersphere Sd−1 ={x ∈
kxk2 = 1}, an angle-oriented distance can be defined as Θ(x, y) =
. Charikar et al. proposes the following LSH family:
0, if ρ>x < 0
where the hashing vector ρ is uniformly sampled from the unit hypersphere Sd−1.

The collision probability is P r[h(x) = h(y)] = 1 − Θ(x, y)/π.

p distance: for linear vector spaces equipped with the p metric, i.e., D (x, y) =
, Datar et al. proposes a hashing algorithm based on lin-
ear projections onto a 1-dimensional line and chopping the line into equal-lengthsegments, as below:
where the hashing vector ρ ∈
is randomly sampled from the p-stable distri-
bution and b·c is the flooring function for rounding. W is the data-dependentwindow size and b is sampled from the uniform distribution U [0, W ).

2 Note that other definitions of LSH exist, such as the one in . However, they are
fundamentally equivalent to the current.

Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
We employ the arccos distance family in the current work, and the locality
sensitivity property will be key to ensuring that our hashing vectors properlypartition the feature space such that near neighbors are grouped together withhigh probability.

K-means is probably the most popular due to its empirical success. The goalof k-means is to seek K prototypes (or cluster centers) that minimizes a pre-specified functional value. These prototypes constitute a Voronoi diagram per seand each sample is assigned to its nearest prototype according to some specificdistance metric. Despite its popularity, for an input feature set of size n, theclassic k-means requires O(Knd) operations per iteration and typically coststens of iterations before convergence, which is computationally forbidden formassive data source, high dimensional feature spaces and large vocabulary sizes.

Tree structured vocabulary requires shorter training time compared withk-means, yet consuming exponentially increasing memory space (w.r.t. the treedepth) to store the splitting information (random dimension, threshold etc.) ofinner tree nodes.

Compared with these aforementioned structures, the proposed RLSV method
is superior in terms of both memory requirement and training time. The majorweakness is the inferior discriminant ability of single vocabulary resulting fromthe intrinsic randomness in visual word generation. To mitigate it, a straight-forward solution is to collect an ensemble of independent random vocabularies,similar to the idea in ERC-Forest
RLSV Construction
The algorithmic pipeline for RLSV can be described in three consequent stepsas follows:
Step-1: visual word generation Assume the similarity between any two sam-ples p, q ∈
can be measured by κ(p, q). Previous studies (see for a brief
survey) reveal the existence of LSH families for many well-known metrics or sim-ilarities such as p. Formally, let H be an LSH family such that for any hash func-tion h ∈ H :
R → {0, 1}, the locality-sensitive property holds. Suppose B hash
functions are independently generated from H, obtaining H = hh1, h2, . . , hBiafter direct concatenation. We proceed to give a formal definition for "randomvisual word" as below:
Definition 1. (random visual word): there is a bi-mapping between any vi-sual word w
i and valid permutation πi from {0, 1}B . Any two samples p, q ∈ R
belong to the same visual word if for any i ∈ {1, . . , B}, there is hi(p)⊕hj(q) = 0,where ⊕ denotes the XOR bit operation.

In ideal case, B hash functions are able to produce at most 2B unique vi-
sual words. However, in practice the evolutionary curve of visual word count
Authors Suppressed Due to Excessive Length
seldom demonstrates such an exponentially growing tendency. The phenomenacan be geometrically understood, since it is almost impossible for the hyper-plane induced by a hash function intersects with all other polyhedrons producedby other hash functions. Since the relationship between B and the number ofvalid vocabulary entries cannot be accurately determined, practically we main-tain the record of vocabulary sizes and continue adding new hash functions untila pre-defined vocabulary size M is reached.

Step-2: visual word filtering In Fig. we plot the sample counts correspond-ing to distinct visual words. It can be seen that it roughly follows a power-law
Shannon Entropy (normalized to 0 1) 0.1
Indices of Visual Words
Indices of Visual Words
Fig. 1. Illustration of the typical distribution of features (left, sorted descend-ingly) and Shannon entropy of labels (right, sorted descendingly) for a 1024-word visual vocabulary in a multi-class problem. Empty words are omitted. Non-informative bins of the vocabulary can be filtered out accordingly by throwingaway low-frequency bins.

distribution and thus part of the vocabulary can be trimmed without notableinformation loss. Moreover, in the supervised (or semi-supervised) settings (dis-cussed in we can also abandon the visual words that have weak discrim-inating power. For multi-class problems, useful statistics can be calculated basedon the entropy of class-label distribution for each valid visual word. In Fig. weplot the entropy distribution on the data set of Caltech-101 (vocabulary size M= 1000). In the above two cases, a simple threshold-based visual word filteringwill benefit outlier removal and yield more compact representations.

Step-3: histogram binning and normalization In practice, we maintainL independent random vocabularies to ensure the stability and enhance theperformance. After vocabulary construction, each feature bag can be transformedinto uniform histogram representations by casting its elements into visual wordsand then perform counting and normalization. For each feature, the binary hashbits H = hh1, h2, . . , hBi determine a unique decimal value in [0, 2B]. Recallthat there are actually no valid visual words corresponding to most decimalvalues, we maintain a word-key mapping table T : {0, 1}B → {1, . . , M } for thepurpose of efficient binning.

Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
Complexity Analysis
Here we provide theoretic comparisons amongst RLSV, ERC-Forest (ERCF),and Hierarchical k-means (HK), in terms of the vocabulary construction timecomplexity, storage requirements, and the query complexity (i.e., a new vectorgets assigned to one of the bins in the vocabulary). Table presents these resultsin Big-O notation. Here we assume all tree-structures are with splitting factorof 2, and D for feature dimension, N for total number of available samples, Knumber of desired cluster centers. For simplicity, we further assume K = 2d,where d + 1 will be the tree depth for binary splitting trees as we have assumed.

Note that c is an undetermined constant in (1, 2), accounting for the empty
Table 1. Comparison of time and space complexity of different methods.

Algorithm Construction Time Space Complexity per Word Query Complexity
buckets that have been generated and trimmed after random projections. Forthe time complexity, RLSV is independent of N , so it can scale up nicely evenwhen the data set is huge. For the storage requirement, KLSH approaches 0 forspace per word as K goes up, whereas the other two methods remain constantfor large K. It is unfortunate that the query complexity of RLSV is not as lowas the ERCF, which could be hurt for very high-dimensional data.

The algorithm presented in subsection targets unsupervised cases in finite-dimensional linear feature spaces. However, both kernel tricks and supervisioninformation are ubiquitous in computer vision databases. For the former, thepairwise similarity is gauged via the inner product in reproducing kernel Hilbertspace (RKHS) While for the latter, supervision information from manuallabeling or annotations is available to regularize the constructed visual words.

Both of them are common scenarios in real-world applications. In this sectionwe discuss the extensions to these cases.

Kernelized RLSV (K-RLSV)
Note that choice of an LSH family in an application depends on the underly-ing metric in the feature space. Here we focus on p distance when p = 2 andArccos distance. Recall that in both cases LSH is feasible based on samplingfrom standard Gaussian distribution which belongs to the p-stable distribution.

Generally, sampling in RKHS is difficult owing to the lack of explicit featurerepresentation. However, we have the following observation (similar to the the-oretic results in yet no zero-centered assumption on the Gram matrices
Authors Suppressed Due to Excessive Length
here), which reveals that sampling from Gaussian distribution in Hilbert spaceis feasible:
Theorem 1. ( 2-keeping projection in RKHS) Denote κ(·, ·) as the innerproduct in Hilbert space K. Given an m-cardinality data set X and correspondingGram matrix G, the 2-metric keeping projection can be expressed as p(x) =Pm
i), where ω(i) only relies on G.

Proof. Denote the implicit Hilbert mapping function as ψ. The geometric meancan be computed as µ
i). For a t-cardinality subset S ⊂ {1 . . m},
ψ ). According to the central limit theorem,
z is distributed as Gaussian Φ(0, Σ), where Σ is the covariance matrix of X .

Further applying a whitening transform, we can obtain the desired hash vectorin K, i.e. r = Σ1/2z. For any datum x, h(x) = ψ(x)>Σ1/2z.

Given Gram matrix G = Ψ >Ψ , where each column of Ψ corresponds to a
feature vector in data set X . It is easily verified that
z>Σ1/2ψ(x) = z>(Ψ Q)(QGQ)− 12 (Ψ Q)>ψ(x)
where Q = I − 1 ee>. Substituting z =
indicator vector for subset S. Finally we get
e)GQ(QGQ)− 12 Q>iΨ >ψ(x)
e)GQ(QGQ)− 12 Q> , thus the conclusion holds.

The complexity of the above procedure is low since m n where n is the sam-
ple count of the whole database (e.g., m = 200 ∼ 1000 and n is probably on theorder of million or billion). From the property of p-stable distribution, for samplesx, y, projection difference p(x) − p(y) = Pm
sustains their distance in the implicit RKHS induced by κ(·, ·), which makes theLSH algorithms mentioned in Section feasible.

Discriminative RLSV (D-RLSV)
Denote the ensemble of all feature bags as B = {bi}, where each bag bi ={xi , . . , x }. In the supervised case, a unique label y
i is assigned to bag bi. For
tractability, we adopt the same assumption to i.e., assuming all features ina bag share the same label. Recall that the scheme in subsection is totallyrandom. A possible improvement is to sequentially select an optimal hashingvectors from a candidate pool according to pre-specified label-oriented crite-rion, which motivates the Discriminative RLSV (D-RLSV) here. The proposedmethod works as follows: suppose k hashing vectors have been generated anddenote the resulting visual words as Vk = {wi, i = 1 . . nk} (nk is the vocabulary
Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
size). To choose the (k + 1)-th hashing vector, for each candidate e
a score based on the Shannon entropy as suggested in defined as:
h) denotes the entropy of the class distribution in the i-th visual
word. Note that adopting e
h will split each of the existing visual words into two.

h) describes the split entropy of the i-th visual word. Formally,
The maximum of HT (
h) is reached when the two partitions have equal size.

Based on the entropy of a given visual word, the impurity can be calculated bythe mutual information of the split, i.e.,
h) = HC (ωi) −
where ω+, ω− are the split of ω
h, and n+, n− denote the number of features
belonging to each new visual word respectively. Finally, the optimal hashingvector for the (k + 1)-th iteration can be determined via h∗ = arg max Sk+1(e
In this section, we evaluate the proposed RLSV and its extensions on real-worlddata sets under three different task settings, i.e., action recognition in video,object recognition, and near-duplicate video detection. Our main concerns in-clude: 1) time used to construct visual vocabularies. 2) memory storage used tokeep vocabulary-related information. 3) performance in terms of accuracy. Allthe experiments are conducted on our common PC with Due-core 3.0Ghz CPUand 8GB physical memory. We choose two representative methods, i.e., Hierar-chical K-means and ERC-Forest for comparison. For the former, we adopta tree branching factor of 2. All statistics are obtained by averaging multipleindependent runs.

Experiment-1: KTH Video Database
The KTH video database was developed by Schuldt et al. in 2004 and is oneof the popular benchmarks for human action recognition. It contains six differ-ent actions captured with appearance variations and mild camera motions suchas zooming in and zooming out. The actions are performed by 25 subjects in 4different scenarios. Each video clips are segmented into 4 sub-clips, resulting in2400 video sequences in total. See Fig. for the illustration of KTH video clips.

Authors Suppressed Due to Excessive Length
Recognition Accuracy (0 1)
K−RLSV with Chi−Square Kernel
Storage Cost for Every Visual Word (in byte)
Codebook Size (in Log scale)
Number of Hashing Vectors (or the depth of the tree)
Fig. 2. Top: Example frames from KTH video database. Bottom Left: Av-eraged recognition rates. Bottom Right: Storage cost for each visual word.

RLSV can consistently achieve comparable performance (with little deviation)with ERC, while consuming less memory. By comparison, k-means is worse inboth performance and space efficiency. Please refer to the color pdf for betterview.

We use the same dataset splitting as in which contains a training set (8persons), a validation set (8 persons) and a test set (9 persons). For local fea-ture descriptors, we describe each video segment using Space-time interest points(STIP) as in around which histogram of gradient (HOG) and histogram offlow (HOF) features are extracted. Both of the counts of independent vocabu-laries in RLSV or trees in ERC-Forest are set to be 50. Recognition accuraciesare presented on the bottom left of Fig. RLSV algorithm family demonstratescomparable discriminating ability to ERC-Forest, but both are obviously betterthan K-means. D-RLSV reaches a peek performance of 91.4% with around 4000visual words, which is in sync with the setting and results reported in Inthe bottom right of Fig. we illustrate the averaged memory storage for thevocabulary-related information per visual word, which well validates our previ-ous complexity analysis.

Caltech-101 is constructed to test object recognition algorithms for semantic cat-egories of images. The data set contains 101 object categories and 1 backgroundcategory, with 40 to 800 images per category. As pre-processing, the maximumdimension of each image is normalized to be 480-pixel. Most objects containedin the images are of similar scale and orientation. However, considering the largeinter-category variation on appearance, lighting and occlusion, the recognition
Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
task on Caltech-101 is still challenging and well suitable to testing various localfeatures and visual vocabularies.

For fair comparison between different vocabularies, we guarantee that they
share the same type of features, and any post-processing on them. Specifically,we extract 3000 SIFT descriptors from each image. Unlike the traditional densesampling strategy on a uniform 2-D image grid, we determine both the locationsand scales of each SIFT feature in a random way ignoring more complexand effective sampling schemes such as However, the experimental resultsfor such a simple scheme are amazingly good, as shown in Fig.
Recognition Accuracy (0 1)
Number of Visual Words (in Log scale)
Codebook Size (in Log scale)
Fig. 3. Left: Performance with 15 training samples per category. Our RLSVfamily shows consistently (with little deviation) comparable or even better per-formance than the other methods. Right: Vocabulary sizes under varying num-ber of hashing vectors in RLSV or tree depth in ERC-Forest. Please refer to thecolor pdf for better view.

We test the proposed method in two settings, either with 15 or 30 training
samples per category. Here we only present the former due to the space con-straint and also the observation that the latter case concurs with the former interms of algorithmic behaviors as compared to other methods. The peak perfor-mance (41.4% for 15 training samples per class) appears with roughly 700 visualwords. The performance drastically decreases with extremely smaller or largervocabularies, which is consistent with previous study in computer vision and ourexperimental settings. For parameter setting, we use 20 independent visual vo-cabularies for RLSV and its extensions, and 20 trees for the ERC-Forest. In theclassification stage, we regress the histogram feature of the testing sample on thecolumn space spanned by all the training samples in each category, and measurethe distance with the regression residue. The overall distance is computed as thesummation over individual visual vocabularies or random trees. Classificationmethods like SVM or NN produce similar yet slightly worse results. As seen inFig. the accuracies produced by RLSV-related methods and ERC-Forest arecomparable, and all are superior to hierarchial K-means. Moreover, it is alsoobserved that the vocabulary sizes roughly linearly increase with respect to thenumber of hashing vectors in RLSV and tree depth in ERC-Forest or K-means,
Authors Suppressed Due to Excessive Length
although the increasing rate of RLSV-related methods are much smaller thanothers.

An interesting comparison is between k-means and random projection (into
a 60-dimensional lower space) followed by k-means (RP-kmeans). The randomprojection is meant to be a lightweight replacement of the PCA for dimensional-ity reduction as suggested in And the simple scheme consistently outperformsthe simple k-means for all vocabulary sizes. This can be partially explained bythe property of reduced eccentricity exhibited by the random projection dis-cussed in Nevertheless, there are still significant performance gaps betweenRP-kmeans and RLSV or ERC-Forest methods. Moreover, there is no funda-mental change in the time and space complexity.

D−RLSV (n=459K)
ERC−Forest (n=459K)K−means (n=459K)
RLSV with 4 hash vectors
K−RLSV (n=459K)
RLSV with 6 hash vectors
RLSV with 8 hash vectors
D−RLSV (n=153K)
RLSV with 10 hash vectors
ERC−Forest (n=153K)
RLSV with 13 hash vectors
K−means (n=153K)
Training Time for Visual Vocabulary (in seconds)
Number of Hash Table
Number of Hash Vectors (or Tree Depth)
Fig. 4. Left: Performance evolution curve with respect to the number of hashtables in RLSV. Right: Time spent to construct visual vocabularies for variousapproaches.

A random vocabulary as in RLSV or ERC-Forest is very easy to generate,
but yet a single one typically performs inferiorly even as compared with k-meansin the classification phase. However, we argue that the ensemble of independent"weak" vocabularies significantly outperforms a single elaborately-designed vo-cabulary, in the same spirit to the bagging algorithm developed in machinelearning. We plot the performance evolution curve w.r.t. the number of hashingvectors for RLSV on the left of Fig. As can been seen, the performance ofRLSV ensemble hikes rapidly and overruns k-means with only a small numberof weak vocabularies.

Fig. (right) also provides an illustration of the time cost for vocabulary con-
struction consumed by various methods. Time spent on vocabulary constructionroughly follows a log-linear rule. Among them, RLSV and K-RLSV are order-of-magnitude faster as compared to others.

Experiment-3: Near-Duplicate Video Detection
We also validate the proposed RLSV method for detecting near-duplicate videoclips. Being intrinsically unsupervised, ERC-Forest and D-RLSV are not suit-able. For such applications, RLSV beats K-means owing to its fast speed and

Randomized Locality Sensitive Vocabularies for Bag-of-Features Model
high flexibility. The near duplicate video benchmark is provided by Wu Thebenchmark comprises 12876 video clips, divided into 24 sets. The groundtruthof each video set is manually labeled, and the most relevant video is treated asthe query video. The task is to distinguish near-duplicate videos to the queryamong each video set.

Fig. 5. Selected video clips from Wu's near-duplicate detection database.

Table 2. Comparison of different approaches on near-duplicate video detection.

Mean Average Precision (MAP)
Time for vocabulary construction (in second)
Time for code generation (in second)
1.85 × 10−2 3.4 × 10−3 2.09 × 10−2
We adopt a key frame based approach. Each video is first segmented and one
key frame is taken from a segment, resulting in 30 key frames per video clip onaverage. We extract a simple HSV color histogram from each key frame. The 24dimensional HSV color histogram is concatenated with 18 bins for Hue, 3 binsfor Saturation and 3 bins for Value as described in As seen in Figure the HSV histogram can greatly vary among different key frames, thus the bag-of-feature model well fits this scenario. We test the Mean Average Precision(MAP) over all 24 queries, and together provide a comparison between timespent on vocabulary construction and code generation of each video clip. A K-RLSV method is also included in the experiment, where the Chi-Square kernelis applied. Table indicates RLSV is a good tradeoff between MAP and speed.

We have presented a simple yet effective algorithm for visual vocabulary con-struction combining the idea of LSH and RF. It avoids the severe problems oc-curring in most existing methods, such as the slow training (e.g., in K-means),or huge storage (e.g., in ERC-Forest). The proposed method strikes a good bal-ance between accuracy and efficacy, and is supposed to be applicable to manyreal-world applications in computer vision. Moreover, extensions to kernerlizedand supervised cases are also presented. We plan to extend the work to on-line settings. Moreover, currently our method is not advantageous in terms ofquery complexity. This motivates us to investigate the possibility of synergizingconstruction and query process towards higher efficiency.

Authors Suppressed Due to Excessive Length
1. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
journal of computer vision 60 (2004) 91–110
2. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is "nearest neigh-
bor" meaningful? In: ICDT. (1999) 217–235
3. Dasgupta, S.: Experiments with random projection. In: UAI. (2000) 143–1514. Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: ICCV.

enius, H.: Scalable recognition with a vocabulary tree. In: CVPR
(2). (2006) 2161–2168
6. Breiman, L.: Random forests. Machine Learning 45 (2001) 5–327. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image
classification. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 1632–1646
8. Yang, L., Jin, R., Sukthankar, R., Jurie, F.: Unifying discriminative visual code-
book generation with classifier training for object category recognition. In: CVPR.

(2008)
9. Lee, H., Battle, A., Raina, R., Ng, A.: Efficient sparse coding algorithms. NIPS
10. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via
sparse representation. PAMI (2009) 210–227
11. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to algorithms. The
12. Charikar, M.:
Similarity estimation techniques from rounding algorithms.

STOC. (2002) 380–388
13. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. Commun. ACM 51 (2008) 117–122
14. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image
search. In: ICCV. (2009)
15. Bentley, J.L.: Multidimensional binary search trees used for associative searching.

Commun. ACM 18 (1975) 509–517
16. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.:
scheme based on p-stable distributions. In: SCG. (2004) 253–262
17. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the
curse of dimensionality. In: STOC. (1998) 604–613
18. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. Commun. ACM 51 (2008) 117–122
19. Scholkopf, B., Smola, A.J.:
Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press (2001)
20. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learn-
ing 63 (2006) 3–42
21. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM
approach. In: ICPR. (2004)
22. Laptev, I., Marsza lek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. (2008)
23. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image
classification. In: ECCV (4). (2006) 490–503
ee, R., Geurts, P., Piater, J.H., Wehenkel, L.: Random subwindows for robust
image classification. In: CVPR (1). (2005) 34–40
25. Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates
from web video search. In: ACM Multimedia. (2007) 218–227

Source: http://sunju.org/docs/eccv10_rv.pdf

Di Eybike Mame The Eternal MotherWomen in Yiddish Theater and Popular Song 1905–1929 ausführliche Textversionextended text version SM 16252 (CD WERGO) Di Eybike Mame (The Eternal Mother):Women in Yiddish Theater and Popular Song, 1905–1929 The recordings presented on this anthology represent a cross-section of women's contribution to theYiddish-language popular song culture which began to develop in the urban centers of Eastern Europein the mid-19th century. The first expression of this emerging culture were the broder-zinger, soloperformers and troupes of singer-songwriters who performed songs and skits in the secular sur-roundings of the inns, wine cellars and restaurant gardens of Jewish centers in Austro-Hungary,Romania and Russia. The broder-zinger were generally maskilim, followers of the haskalah, the

Impact of Agricultural and Waste Water Treatment Facility Runoff on the Incidence of Antibiotic Resistant Bacteria in Streams Due to increased usage of antibiotic drugs over the past few decades, researchers are finding increasing proportions of bacteria in the environment that are resistant to antibiotics.Areas that are especially affected include streams that receive runoff from farms utilizingantibiotic drugs in their animal feed and from waste water treatment facilities. The goal ofthis study was to determine if these types of pollution are causing an increase in populationsof antibiotic resistant bacteria in streams. Water was collected from three points along astream receiving runoff from agricultural areas and from points above, at, and below theoutflow pipe of a waste water treatment facility. Water was also collected from a locationgeographically removed from these pollution sources. Bacteria filtered from the watersamples were plated on media selective for the growth of coliforms or media selective forthe growth of Acinetobacter. Colonies picked from these plates were grown on mediacontaining ampicillin, chloramphenicol, norfloxacin, streptomycin, or tetracycline, or noantibiotic. Susceptibility or resistance to antibiotics was determined by comparing thepercentage of colonies that grew on media with and without antibiotic. The number ofcoliform bacteria resistant to ampicillin was significantly higher at the waste water treatmentfacility outflow pipe than upstream of the outflow. Greater numbers of coliforms andAcinetobacter resistant to chloramphenicol, streptomycin, and tetracycline were also foundat and below the outflow compared to upstream. Agricultural runoff seems to contribute toan increase in the number of coliform bacteria resistant to ampicillin, streptomycin, andtetracycline, and to the number of Acinetobacter resistant to tetracycline. These resultsappear to indicate that the use of antibiotics in both agriculture and in humans is increasingthe incidence of antibiotic resistant bacteria in lotic environments.