Lazier Than Lazy Greedy

Baharan Mirzasoleiman

ETH Zurich

[email protected]

Ashwinkumar Badanidiyuru

Google Research Mountain View

[email protected]

Amin Karbasi

Yale University

[email protected]

Jan Vondr

IBM Almaden

[email protected]

Andreas Krause

ETH Zurich

[email protected]

Abstract

Is it possible to maximize a monotone submodular function

faster than the widely used lazy greedy algorithm (also known

as accelerated greedy), both in theory and practice? In this

paper, we develop the ﬁrst linear-time algorithm for maxi-

mizing a general monotone submodular function subject to

a cardinality constraint. We show that our randomized algo-

rithm, STOCHASTIC-GREEDY, can achieve a (1 − 1/e − ε)

approximation guarantee, in expectation, to the optimum so-

lution in time linear in the size of the data and independent of

the cardinality constraint. We empirically demonstrate the ef-

fectiveness of our algorithm on submodular functions arising

in data summarization, including training large-scale kernel

methods, exemplar-based clustering, and sensor placement.

We observe that STOCHASTIC-GREEDY practically achieves

the same utility value as lazy greedy but runs much faster.

More surprisingly, we observe that in many practical scenar-

ios STOCHASTIC-GREEDY does not evaluate the whole frac-

tion of data points even once and still achieves indistinguish-

able results compared to lazy greedy.

Introduction

For the last several years, we have witnessed the emergence

of datasets of an unprecedented scale across different scien-

tiﬁc disciplines. The large volume of such datasets presents

new computational challenges as the diverse, feature-rich,

unstructured and usually high-resolution data does not allow

for effective data-intensive inference. In this regard, data

summarization is a compelling (and sometimes the only)

approach that aims at both exploiting the richness of large-

scale data and being computationally tractable. Instead of

operating on complex and large data directly, carefully con-

structed summaries not only enable the execution of vari-

ous data analytics tasks but also improve their efﬁciency and

scalability.

In order to effectively summarize the data, we need to

deﬁne a measure for the amount of representativeness that

lies within a selected set. If we think of representative el-

ements as the ones that cover best, or are most informative

w.r.t. the items in a dataset then naturally adding a new el-

ement to a set of representatives, say A, is more beneﬁcial

than adding it to its superset, say B ⊇ A, as the new element

 2015, Association for the Advancement of Artiﬁcial

can potentially enclose more uncovered items when consid-

ered with elements in A rather than B. This intuitive di-

minishing returns property can be systematically formalized

through submodularity (c.f., Nemhauser, Wolsey, and Fisher

(1978)). More precisely, a submodular function f : 2

→ R

assigns a subset A ⊆ V a utility value f(A) –measuring the

representativeness of the set A– such that

f(A ∪ {i}) − f(A) ≥ f(B ∪ {i}) − f (B)

for any A ⊆ B ⊆ V and i ∈ V \ B. Note that

∆(i|A)

= f (A ∪ {i}) − f(A) measures the marginal gain

of adding a new element i to a summary A. Of course,

the meaning of representativeness (or utility value) depends

very much on the underlying application; for a collection

of random variables, the utility of a subset can be mea-

sured in terms of entropy, and for a collection of vectors, the

utility of a subset can be measured in terms of the dimen-

sion of a subspace spanned by them. In fact, summariza-

tion through submodular functions has gained a lot of inter-

est in recent years with application ranging from exemplar-

based clustering (Gomes and Krause 2010), to document

(Lin and Bilmes 2011; Dasgupta, Kumar, and Ravi 2013)

and corpus summarization (Sipos et al. 2012), to recom-

mender systems (Leskovec et al. 2007; El-Arini et al. 2009;

El-Arini and Guestrin 2011).

Since we would like to choose a summary of a manage-

able size, a natural optimization problem is to ﬁnd a sum-

mary A

∗

of size at most k that maximizes the utility, i.e.,

∗

= argmax

A:|A|≤k

f(A). (1)

Unfortunately, this optimization problem is NP-hard for

many classes of submodular functions (Nemhauser and

Wolsey 1978; Feige 1998). We say a submodular function is

monotone if for any A ⊆ B ⊆ V we have f(A) ≤ f(B). A

celebrated result of Nemhauser, Wolsey, and Fisher (1978)

–with great importance in artiﬁcial intelligence and machine

learning– states that for non-negative monotone submodu-

lar functions a simple greedy algorithm provides a solution

with (1 − 1/e) approximation guarantee to the optimal (in-

tractable) solution. This greedy algorithm starts with the

empty set A

and in iteration i, adds an element maximiz-

ing the marginal gain ∆(e|A

i−1

). For a ground set V of

size n, this greedy algorithm needs O(n · k) function eval-

uations in order to ﬁnd a summarization of size k. How-

ever, in many data intensive applications, evaluating f is

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

1812

expensive and running the standard greedy algorithm is in-

feasible. Fortunately, submodularity can be exploited to im-

plement an accelerated version of the classical greedy algo-

rithm, usually called LAZY-GREEDY (Minoux 1978a). In-

stead of computing ∆(e|A

i−1

) for each element e ∈ V , the

LAZY-GREEDY algorithm keeps an upper bound ρ(e) (ini-

tially ∞) on the marginal gain sorted in decreasing order. In

each iteration i, the LAZY-GREEDY algorithm evaluates the

element on top of the list, say e, and updates its upper bound,

ρ(e) ← ∆(e|A

i−1

). If after the update ρ(e) ≥ ρ(e

) for all

6= e, submodularity guarantees that e is the element with

the largest marginal gain. Even though the exact cost (i.e.,

number of function evaluations) of LAZY-GREEDY is un-

known, this algorithm leads to orders of magnitude speedups

in practice. As a result, it has been used as the state-of-the-

art implementation in numerous applications including net-

work monitoring (Leskovec et al. 2007), network inference

(Rodriguez, Leskovec, and Krause 2012), document sum-

marization (Lin and Bilmes 2011), and speech data subset

selection (Wei et al. 2013), to name a few. However, as the

size of the data increases, even for small values of k, run-

ning LAZY-GREEDY is infeasible. A natural question to ask

is whether it is possible to further accelerate LAZY-GREEDY

by a procedure with a weaker dependency on k. Or even bet-

ter, is it possible to have an algorithm that does not depend

on k at all and scales linearly with the data size n?

In this paper, we propose the ﬁrst linear-time algorithm,

STOCHASTIC-GREEDY, for maximizing a non-negative

monotone submodular function subject to a cardinality con-

straint k. We show that STOCHASTIC-GREEDY achieves

a (1 − 1/e − ) approximation guarantee to the opti-

mum solution with running time O(n log(1/)) (measured

in terms of function evaluations) that is independent of

k. Our experimental results on exemplar-based clustering

and active set selection in nonparametric learning also con-

ﬁrms that STOCHASTIC-GREEDY consistently outperforms

LAZY-GREEDY by a large margin while achieving practi-

cally the same utility value. More surprisingly, in our exper-

iments we observe that STOCHASTIC-GREEDY sometimes

does not even evaluate all the items and shows a running

time that is less than n while still providing solutions close

to the ones returned by LAZY-GREEDY. Due to its inde-

pendence of k, STOCHASTIC-GREEDY is the ﬁrst algorithm

that truly scales to voluminous datasets.

Related Work

Submodularity is a property of set functions with deep the-

oretical and practical consequences. For instance, sub-

modular maximization generalizes many well-known com-

binatorial problems including maximum weighted match-

ing, max coverage, and facility location, to name a few. It

has also found numerous applications in machine learning

and artiﬁcial intelligence such as inﬂuence maximization

(Kempe, Kleinberg, and Tardos 2003), information gath-

ering (Krause and Guestrin 2011), document summariza-

tion (Lin and Bilmes 2011) and active learning (Guillory

and Bilmes 2011; Golovin and Krause 2011). In most of

these applications one needs to handle increasingly larger

quantities of data. For this purpose, accelerated/lazy vari-

ants (Minoux 1978a; Leskovec et al. 2007) of the celebrated

greedy algorithm of Nemhauser, Wolsey, and Fisher (1978)

have been extensively used.

Scaling Up: To solve the optimization problem (1) at

scale, there have been very recent efforts to either make

use of parallel computing methods or treat data in a

streaming fashion. In particular, Chierichetti, Kumar, and

Tomkins (2010) and Blelloch, Peng, and Tangwongsan

(2011) addressed a particular instance of submodular

functions, namely, maximum coverage and provided a

distributed method with a constant factor approximation

to the centralized algorithm. More generally, Kumar et al.

(2013) provided a constant approximation guarantee for

general submodular functions with bounded marginal gains.

Contemporarily, Mirzasoleiman et al. (2013) developed a

two-stage distributed algorithm that guarantees solutions

close to the optimum if the dataset is massive and the

submodular function is smooth.

Similarly, Gomes and Krause (2010) presented a heuristic

streaming algorithm for submodular function maximization

and showed that under strong assumptions about the way

the data stream is generated their method is effective. Very

recently, Badanidiyuru et al. (2014) provided the ﬁrst one-

pass streaming algorithm with a constant factor approxima-

tion guarantee for general submodular functions without any

assumption about the data stream.

Even though the goal of this paper is quite different

and complementary in nature to the aforementioned work,

STOCHASTIC-GREEDY can be easily integrated into ex-

isting distributed methods. For instance, STOCHASTIC-

GREEDY can replace LAZY-GREEDY for solving each sub-

problem in the approach of Mirzasoleiman et al. (2013).

More generally, any distributed algorithm that uses LAZY-

GREEDY as a sub-routine, can directly beneﬁt from our

method and provide even more efﬁcient large-scale algorith-

mic frameworks.

Our Approach: In this paper, we develop the ﬁrst central-

ized algorithm whose cost (i.e., number of function evalua-

tions) is independent of the cardinality constraint, which in

turn directly addresses the shortcoming of LAZY-GREEDY.

Perhaps the closest, in spirit, to our efforts are approaches

by Wei, Iyer, and Bilmes (2014) and Badanidiyuru and

Vondr

ak (2014). Concretely, Wei, Iyer, and Bilmes (2014)

proposed a multistage algorithm, MULTI-GREEDY, that

tries to decrease the running time of LAZY-GREEDY by

approximating the underlying submodular function with a

set of (sub)modular functions that can be potentially eval-

uated less expensively. This approach is effective only

for those submodular functions that can be easily decom-

posed and approximated. Note again that STOCHASTIC-

GREEDY can be used for solving the sub-problems in each

stage of MULTI-GREEDY to develop a faster multistage

method. Badanidiyuru and Vondr

ak (2014) proposed a dif-

ferent centralized algorithm that achieves a (1 − 1/e −

) approximation guarantee for general submodular func-

tions using O(n/ log(n/)) function evaluations. However,

1813

STOCHASTIC-GREEDY consistently outperforms their algo-

rithm in practice in terms of cost, and returns higher util-

ity value. In addition, STOCHASTIC-GREEDY uses only

O(n log(1/)) function evaluations in theory, and thus pro-

vides a stronger analytical guarantee.

STOCHASTIC-GREEDY Algorithm

In this section, we present our randomized greedy algorithm

STOCHASTIC-GREEDY and then show how to combine it

with lazy evaluations. We will show that STOCHASTIC-

GREEDY has provably linear running time independent of

k, while simultaneously having the same approximation ra-

tio guarantee (in expectation). In the following section we

will further demonstrate through experiments that this is also

reﬂected in practice, i.e., STOCHASTIC-GREEDY is substan-

tially faster than LAZY-GREEDY, while being practically

identical to it in terms of the utility.

The main idea behind STOCHASTIC-GREEDY is to pro-

duce an element which improves the value of the solution

roughly the same as greedy, but in a fast manner. This is

achieved by a sub-sampling step. At a very high level this

is similar to how stochastic gradient descent improves the

running time of gradient descent for convex optimization.

Random Sampling

The key reason that the classic greedy algorithm works is

that at each iteration i, an element is identiﬁed that reduces

the gap to the optimal solution by a signiﬁcant amount, i.e.,

by at least (f(A

∗

) − f(A

i−1

))/k. This requires n oracle

calls per step, the main bottleneck of the classic greedy al-

gorithm. Our main observation here is that by submodu-

larity, we can achieve the same improvement by adding a

uniformly random element from A

∗

to our current set A. To

get this improvement, we will see that it is enough to ran-

domly sample a set R of size (n/k) log(1/), which in turn

overlaps with A

∗

with probability 1 − . This is the main

reason we are able to achieve a boost in performance.

The algorithm is formally presented in Algorithm 1. Sim-

ilar to the greedy algorithm, our algorithm starts with an

empty set and adds one element at each iteration. But in

each step it ﬁrst samples a set R of size (n/k) log(1/) uni-

formly at random and then adds the element from R to A

which increases its value the most.

Algorithm 1 STOCHASTIC-GREEDY

Input: f : 2

→ R

, k ∈ {1, . . . , n}.

Output: A set A ⊆ V satisfying |A| ≤ k.

1: A ← ∅.

2: for (i ← 1; i ≤ k; i ← i + 1) do

3: R ← a random subset obtained by sampling s random

elements from V \ A.

4: a

← argmax

a∈R

∆(a|A).

5: A ← A ∪ {a

}

6: return A.

Our main theoretical result is the following. It shows that

STOCHASTIC-GREEDY achieves a near-optimal solution for

general monotone submodular functions, with computa-

tional complexity independent of the cardinality constraint.

Theorem 1. Let f be a non-negative monotone submoduar

function. Let us also set s =

log



. Then STOCHASTIC-

GREEDY achieves a (1 − 1/e − ) approximation guarantee

in expectation to the optimum solution of problem (1) with

only O(n log



) function evaluations.

Since there are k iterations in total and at each it-

eration we have (n/k) log(1/) elements, the total

number of function evaluations cannot be more than

k × (n/k) log(1/) = n log(1/). The proof of the

approximation guarantee is given in the analysis section.

Random Sampling with Lazy Evaluation

While our theoretical results show a provably linear time al-

gorithm, we can combine the random sampling procedure

with lazy evaluation to boost its performance. There are

mainly two reasons why lazy evaluation helps. First, the

randomly sampled sets can overlap and we can exploit the

previously evaluated marginal gains. Second, as in LAZY-

GREEDY although the marginal values of the elements might

change in each step of the greedy algorithm, often their or-

dering does not change (Minoux 1978a). Hence in line 4

of Algorithm 1 we can apply directly lazy evaluation as fol-

lows. We maintain an upper bound ρ(e) (initially ∞) on

the marginal gain of all elements sorted in decreasing order.

In each iteration i, STOCHASTIC-GREEDY samples a set R.

From this set R it evaluates the element that comes on top

of the list. Let’s denote this element by e. It then updates

the upper bound for e, i.e., ρ(e) ← ∆(e|A

i−1

). If after the

update ρ(e) ≥ ρ(e

) for all e

6= e where e, e

∈ R, sub-

modularity guarantees that e is the element with the largest

marginal gain in the set R. Hence, lazy evaluation helps us

reduce function evaluation in each round.

Experimental Results

In this section, we address the following questions: 1)

how well does STOCHASTIC-GREEDY perform compared

to previous art and in particular LAZY-GREEDY, and 2)

How does STOCHASTIC-GREEDY help us get near optimal

solutions on large datasets by reducing the computational

complexity? To this end, we compare the performance of

our STOCHASTIC-GREEDY method to the following bench-

marks: RANDOM-SELECTION, where the output is k ran-

domly selected data points from V ; LAZY-GREEDY, where

the output is the k data points produced by the accelerated

greedy method (Minoux 1978a); SAMPLE-GREEDY, where

the output is the k data points produced by applying LAZY-

GREEDY on a subset of data points parametrized by sam-

pling probability p; and THRESHOLD-GREEDY, where the

output is the k data points provided by the algorithm of

Badanidiyuru et al. (2014). In order to compare the com-

putational cost of different methods independently of the

concrete implementation and platform, in our experiments

we measure the computational cost in terms of the number

of function evaluations used. Moreover, to implement the

SAMPLE-GREEDY method, random subsamples are gener-

ated geometrically using different values for probability p.

1814

Higher values of p result in subsamples of larger size from

the original dataset. To maximize fairness, we implemented

an accelerated version of THRESHOLD-GREEDY with lazy

evaluations (not speciﬁed in the paper) and report the best

results in terms of function evaluations. Among all bench-

marks, RANDOM-SELECTION has the lowest computational

cost (namely, one) as we need to only evaluate the selected

set at the end of the sampling process. However, it provides

the lowest utility. On the other side of the spectrum, LAZY-

GREEDY makes k passes over the full ground set, providing

typically the best solution in terms of utility. The lazy eval-

uation eliminates a large fraction of the function evaluations

in each pass. Nonetheless, it is still computationally pro-

hibitive for large values of k.

In our experimental setup, we focus on three important

and classic machine learning applications: nonparametric

learning, exemplar-based clustering, and sensor placement.

Nonparametric Learning. Our ﬁrst application is data

subset selection in nonparametric learning. We focus on the

special case of Gaussian Processes (GPs) below, but simi-

lar problems arise in large-scale kernelized SVMs and other

kernel machines. Let X

, be a set of random variables in-

dexed by the ground set V . In a Gaussian Process (GP)

we assume that every subset X

, for S = {e

, . . . , e

is distributed according to a multivariate normal distribu-

tion, i.e., P (X

= x

) = N (x

; µ

, Σ

S,S

), where µ

(µ

, . . . , µ

) and Σ

S,S

= [K

](1 ≤ i, j ≤ k) are

the prior mean vector and prior covariance matrix, respec-

tively. The covariance matrix is given in terms of a posi-

tive deﬁnite kernel K, e.g., the squared exponential kernel

= exp(−|e

− e

) is a common choice in prac-

tice. In GP regression, each data point e ∈ V is considered a

random variable. Upon observations y

= x

+ n

(where

is a vector of independent Gaussian noise with variance

), the predictive distribution of a new data point e ∈ V is

a normal distribution P(X

| y

) = N (µ

e|A

, Σ

e|A

), where

e|A

= µ

+ Σ

e,A

(Σ

A,A

+ σ

−1

− µ

e|A

= σ

− Σ

e,A

(Σ

A,A

+ σ

−1

A,e

. (2)

Note that evaluating (2) is computationally expensive as

it requires a matrix inversion. Instead, most efﬁcient ap-

proaches for making predictions in GPs rely on choosing a

small – so called active – set of data points. For instance, in

the Informative Vector Machine (IVM) we seek a summary

A such that the information gain, f(A) = I(Y

; X

) =

H(X

) − H(X

) =

log det(I + σ

−2

A,A

) is max-

imized. It can be shown that this choice of f is monotone

submodular (Krause and Guestrin 2005). For small values

of |A|, running LAZY-GREEDY is possible. However, we

see that as the size of the active set or summary A increases,

the only viable option in practice is STOCHASTIC-GREEDY.

In our experiment we chose a Gaussian kernel with h =

0.75 and σ = 1. We used the Parkinsons Telemonitor-

ing dataset (Tsanas et al. 2010) consisting of 5,875 bio-

medical voice measurements with 22 attributes from peo-

ple with early-stage Parkinsons disease. We normalized the

vectors to zero mean and unit norm. Fig. 1a and 1b com-

pare the utility and computational cost of STOCHASTIC-

GREEDY to the benchmarks for different values of k. For

THRESHOLD-GREEDY, different values of  have been cho-

sen such that a performance close to that of LAZY-GREEDY

is obtained. Moreover, different values of p have been cho-

sen such that the cost of SAMPLE-GREEDY is almost equal

to that of STOCHASTIC-GREEDY for different values of .

As we can see, STOCHASTIC-GREEDY provides the closest

(practically identical) utility to that of LAZY-GREEDY with

much lower computational cost. Decreasing the value of ε

results in higher utility at the price of higher computational

cost. Fig. 1c shows the utility versus cost of STOCHASTIC-

GREEDY along with the other benchmarks for a ﬁxed k =

200 and different values of . STOCHASTIC-GREEDY pro-

vides very compelling tradeoffs between utility and cost

compared to all benchmarks, including LAZY-GREEDY.

Exemplar-based clustering. A classic way to select a set

of exemplars that best represent a massive dataset is to solve

the k-medoid problem (Kaufman and Rousseeuw 2009) by

minimizing the sum of pairwise dissimilarities between ex-

emplars A and elements of the dataset V as follows: L(A) =

e∈V

min

v∈A

d(e, v), where d : V × V → R is a dis-

tance function, encoding the dissimilarity between elements.

By introducing an appropriate auxiliary element e

we can

turn L into a monotone submodular function (Gomes and

Krause 2010): f (A) = L({e

}) − L(A ∪ {e

}). Thus max-

imizing f is equivalent to minimizing L which provides a

very good solution. But the problem becomes computation-

ally challenging as the size of the summary A increases.

In our experiment we chose d(x, x

) = ||x − x

for the

dissimilarity measure. We used a set of 10,000 Tiny Images

(Torralba, Fergus, and Freeman 2008) where each 32 × 32

RGB image was represented by a 3,072 dimensional vector.

We subtracted from each vector the mean value, normalized

it to unit norm, and used the origin as the auxiliary exem-

plar. Fig. 1d and 1e compare the utility and computational

cost of STOCHASTIC-GREEDY to the benchmarks for differ-

ent values of k. It can be seen that STOCHASTIC-GREEDY

outperforms the benchmarks with signiﬁcantly lower com-

putational cost. Fig. 1f compares the utility versus cost of

different methods for a ﬁxed k = 200 and various p and .

Similar to the previous experiment, STOCHASTIC-GREEDY

achieves near-maximal utility at substantially lower cost

compared to the other benchmarks.

Large scale experiment. We also performed a simi-

lar experiment on a larger set of 50,000 Tiny Images. For

this dataset, we were not able to run LAZY-GREEDY and

THRESHOLD-GREEDY. Hence, we compared the utility and

cost of STOCHASTIC-GREEDY with RANDOM-SELECTION

using different values of p. As shown in Fig. 1j and Fig. 1k,

STOCHASTIC-GREEDY outperforms SAMPLE-GREEDY in

terms of both utility and cost for different values of k. Fi-

nally, as Fig. 1l shows that STOCHASTIC-GREEDY achieves

the highest utility but performs much faster compare to

SAMPLE-GREEDY which is the only practical solution for

this larger dataset.

1815

Sensor Placement. When monitoring spatial phenomena,

we want to deploy a limited number of sensors in an area

in order to quickly detect contaminants. Thus, the problem

would be to select a subset of all possible locations A ⊆ V to

place sensors. Consider a set of intrusion scenarios I where

each scenario i ∈ I deﬁnes the introduction of a contam-

inant at a speciﬁed point in time. For each sensor s ∈ S

and scenario i, the detection time, T (s, i), is deﬁned as the

time it takes for s to detect i. If s never detects i, we set

T (s, i) = ∞. For a set of sensors A, detection time for

scenario i could be deﬁned as T (A, i) = min

s∈A

T (s, i).

Depending on the time of detection, we incur penalty π

(t)

for detecting scenario i at time t. Let π

(∞) be the max-

imum penalty incurred if the scenario i is not detected at

all. Then, the penalty reduction for scenario i can be de-

ﬁned as R(A, i) = π

(∞) − π

(T (A, i)). Having a proba-

bility distribution over possible scenarios, we can calculate

the expected penalty reduction for a sensor placement A as

R(A) =

i∈I

P (i)R(A, i). This function is montone sub-

modular (Krause et al. 2008) and for which the greedy algo-

rithm gives us a good solution. For massive data however,

we may need to resort to STOCHASTIC-GREEDY.

In our experiments we used the 12,527 node distribu-

tion network provided as part of the Battle of Water Sen-

sor Networks (BWSN) challenge (Ostfeld et al. 2008). Fig.

1g and 1h compare the utility and computational cost of

STOCHASTIC-GREEDY to the benchmarks for different val-

ues of k. It can be seen that STOCHASTIC-GREEDY out-

performs the benchmarks with signiﬁcantly lower computa-

tional cost. Fig. 1i compares the utility versus cost of differ-

ent methods for a ﬁxed k = 200 and various p and . Again

STOCHASTIC-GREEDY shows similar behavior to the previ-

ous experiments by achieving near-maximal utility at much

lower cost compared to the other benchmarks.

Conclusion

We have developed the ﬁrst linear time algorithm

STOCHASTIC-GREEDY with no dependence on k

for cardinality constrained submodular maximization.

STOCHASTIC-GREEDY provides a 1 − 1/e −  approx-

imation guarantee to the optimum solution with only

n log



function evaluations. We have also demonstrated

the effectiveness of our algorithm through an extensive set

of experiments. As these show, STOCHASTIC-GREEDY

achieves a major fraction of the function utility with much

less computational cost. This improvement is useful even

in approaches that make use of parallel computing or de-

compose the submodular function into simpler functions for

faster evaluation. The properties of STOCHASTIC-GREEDY

make it very appealing and necessary for solving very

large scale problems. Given the importance of submodular

optimization to numerous AI and machine learning appli-

cations, we believe our results provide an important step

towards addressing such problems at scale.

Appendix, Analysis

The following lemma gives us the approximation guarantee.

Lemma 2. Given a current solution A, the expected

gain of STOCHASTIC-GREEDY in one step is at least

1−

a∈A

∗

∆(a|A).

Proof. Let us estimate the probability that R∩(A

∗

\A) 6= ∅.

The set R consists of s =

log



random samples from

V \ A (w.l.o.g. with repetition), and hence

Pr[R ∩ (A

∗

\ A) = ∅] =



1 −

∗

\ A|

|V \ A|



≤ e

−s

∗

\A|

|V \A|

≤ e

−

∗

\A|

Therefore, by using the concavity of 1 − e

−

as a function

of x and the fact that x = |A

∗

\ A| ∈ [0, k], we have

Pr[R∩(A

∗

\A) 6=∅]≥1−e

−

∗

\A|

≥(1−e

−

)

∗

\ A|

Recall that we chose s =

log



, which gives

Pr[R ∩ (A

∗

\ A) 6= ∅] ≥ (1 − )

∗

\ A|

. (3)

Now consider STOCHASTIC-GREEDY: it picks an element

a ∈ R maximizing the marginal value ∆(a|A). This is

clearly as much as the marginal value of an element ran-

domly chosen from R ∩ (A

∗

\ A) (if nonempty). Overall,

R is equally likely to contain each element of A

∗

\ A, so

a uniformly random element of R ∩ (A

∗

\ A) is actually a

uniformly random element of A

∗

\ A. Thus, we obtain

E[∆(a|A)]≥ Pr[R∩(A

∗

\A) 6= ∅]×

∗

\ A|

a∈A

∗

∆(a|A).

Using (3), we conclude that E[∆(a|A)] ≥

1−

a∈A

∗

∆(a|A).

Now it is straightforward to ﬁnish the proof of Theo-

rem 1. Let A

= {a

, . . . , a

} denote the solution returned

by STOCHASTIC-GREEDY after i steps. From Lemma 2,

E[∆(a

i+1

) | A

] ≥

1 − 

a∈A

∗

∆(a|A

By submodularity,

a∈A

∗

∆(a|A

) ≥ ∆(A

∗

) ≥ f(A

∗

) − f(A

Therefore,

E[f(A

i+1

) − f(A

) | A

] = E[∆(a

i+1

) | A

]

≥

1 − 

(f(A

∗

) − f(A

)).

By taking expectation over A

E[f(A

i+1

) − f(A

)] ≥

1 − 

E[f(A

∗

) − f(A

)].

By induction, this implies that

E[f(A

)] ≥

1 −



1 −

1 − 



f(A

∗

)

≥



1 − e

−(1−)



f(A

∗

) ≥ (1 − 1/e − )f(A

∗

Acknowledgment. This research was supported by SNF

200021-137971, ERC StG 307036, a Microsoft Faculty Fel-

lowship, and an ETH Fellowship.

1816

Figure 1: Performance comparisons. a), d) g) and j) show the performance of all the algorithms for different values of k on

Parkinsons Telemonitoring, a set of 10,000 Tiny Images, Water Network, and a set of 50,000 Tiny Images respectively. b), e) h)

and k) show the cost of all the algorithms for different values of k on the same datasets. c), f), i), l) show the utility obtained

versus cost for a ﬁxed k = 200.

1817

References

Badanidiyuru, A., and Vondr

ak, J. 2014. Fast algorithms for

maximizing submodular functions. In SODA, 1497–1514.

Badanidiyuru, A.; Mirzasoleiman, B.; Karbasi, A.; and

Krause, A. 2014. Streaming submodular optimization: Mas-

sive data summarization on the ﬂy. In KDD.

Blelloch, G. E.; Peng, R.; and Tangwongsan, K. 2011.

Linear-work greedy parallel approximate set cover and vari-

ants. In SPAA.

Chierichetti, F.; Kumar, R.; and Tomkins, A. 2010. Max-

cover in map-reduce. In Proceedings of the 19th interna-

tional conference on World wide web.

Dasgupta, A.; Kumar, R.; and Ravi, S. 2013. Summarization

through submodularity and dispersion. In ACL.

Dueck, D., and Frey, B. J. 2007. Non-metric afﬁnity propa-

gation for unsupervised image categorization. In ICCV.

El-Arini, K., and Guestrin, C. 2011. Beyond keyword

search: Discovering relevant scientiﬁc literature. In KDD.

El-Arini, K.; Veda, G.; Shahaf, D.; and Guestrin, C. 2009.

Turning down the noise in the blogosphere. In KDD.

Feige, U. 1998. A threshold of ln n for approximating set

cover. Journal of the ACM.

Golovin, D., and Krause, A. 2011. Adaptive submodularity:

Theory and applications in active learning and stochastic op-

timization. Journal of Artiﬁcial Intelligence Research.

Gomes, R., and Krause, A. 2010. Budgeted nonparametric

learning from data streams. In ICML.

Guillory, A., and Bilmes, J. 2011. Active semi-supervised

learning using submodular functions. In UAI. AUAI.

Kaufman, L., and Rousseeuw, P. J. 2009. Finding groups

in data: an introduction to cluster analysis, volume 344.

Wiley-Interscience.

Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing

the spread of inﬂuence through a social network. In Pro-

ceedings of the ninth ACM SIGKDD.

Krause, A., and Guestrin, C. 2005. Near-optimal nonmyopic

value of information in graphical models. In UAI.

Krause, A., and Guestrin, C. 2011. Submodularity and

its applications in optimized information gathering. ACM

Transactions on Intelligent Systems and Technology.

Krause, A.; Leskovec, J.; Guestrin, C.; VanBriesen, J.; and

Faloutsos, C. 2008. Efﬁcient sensor placement optimiza-

tion for securing large water distribution networks. Journal

of Water Resources Planning and Management 134(6):516–

526.

Kumar, R.; Moseley, B.; Vassilvitskii, S.; and Vattani, A.

2013. Fast greedy algorithms in mapreduce and streaming.

In SPAA.

Leskovec, J.; Krause, A.; Guestrin, C.; Faloutsos, C.; Van-

Briesen, J.; and Glance, N. 2007. Cost-effective outbreak

detection in networks. In KDD, 420–429. ACM.

Lin, H., and Bilmes, J. 2011. A class of submodular func-

tions for document summarization. In North American chap-

ter of the Assoc. for Comp. Linguistics/Human Lang. Tech.

Minoux, M. 1978a. Accelerated greedy algorithms for max-

imizing submodular set functions. Optimization Techniques,

LNCS 234–243.

Minoux, M. 1978b. Accelerated greedy algorithms for max-

imizing submodular set functions. In Proc. of the 8th IFIP

Conference on Optimization Techniques. Springer.

Mirzasoleiman, B.; Karbasi, A.; Sarkar, R.; and Krause, A.

2013. Distributed submodular maximization: Identifying

representative elements in massive data. In Neural Infor-

mation Processing Systems (NIPS).

Nemhauser, G. L., and Wolsey, L. A. 1978. Best algorithms

for approximating the maximum of a submodular set func-

tion. Math. Oper. Research.

Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978.

An analysis of approximations for maximizing submodular

set functions - I. Mathematical Programming.

Ostfeld, A.; Uber, J. G.; Salomons, E.; Berry, J. W.; Hart,

W. E.; Phillips, C. A.; Watson, J.-P.; Dorini, G.; Jonker-

gouw, P.; Kapelan, Z.; et al. 2008. The battle of the water

sensor networks (bwsn): A design challenge for engineers

and algorithms. Journal of Water Resources Planning and

Management 134(6):556–568.

Rodriguez, M. G.; Leskovec, J.; and Krause, A. 2012. Infer-

ring networks of diffusion and inﬂuence. ACM Transactions

on Knowledge Discovery from Data 5(4):21:1–21:37.

Sipos, R.; Swaminathan, A.; Shivaswamy, P.; and Joachims,

T. 2012. Temporal corpus summarization using submodular

word coverage. In CIKM.

Torralba, A.; Fergus, R.; and Freeman, W. T. 2008. 80 mil-

lion tiny images: A large data set for nonparametric object

and scene recognition. IEEE Trans. Pattern Anal. Mach. In-

tell.

Tsanas, A.; Little, M. A.; McSharry, P. E.; and Ramig, L. O.

2010. Enhanced classical dysphonia measures and sparse re-

gression for telemonitoring of parkinson’s disease progres-

sion. In IEEE Int. Conf. Acoust. Speech Signal Process.

Wei, K.; Liu, Y.; Kirchhoff, K.; and Bilmes, J. 2013. Using

document summarization techniques for speech data subset

selection. In HLT-NAACL, 721–726.

Wei, K.; Iyer, R.; and Bilmes, J. 2014. Fast multi-stage

submodular maximization. In ICML.

1818