Cs.kent.edu
A Hypergraph-based Method for Discovering Semantically Associated Itemsets
Haishan Liu
∗, Paea LePendu
†, Ruoming Jin
‡ and Dejing Dou
∗
∗ Department of Computer and Information Science
University of Oregon, Eugene, OR, 97403, USA
Email: [email protected] & [email protected]
† Stanford Center for Biomedical Informatics Research
Stanford University, Stanford, CA, 94305, USA
‡ Computer Science Department
Kent State University, Kent, OH, 44242, USA
Abstract—In this paper, we address an interesting data
publication and each column a boolean variable indicating
mining problem of finding semantically associated itemsets,
if some term is appeared. Using traditional frequent itemset
i.e., items connected via indirect links. We propose a novel
generation algorithms, if the number of times the item pair
method for discovering semantically associated itemsets based
fish oil and
Raynaud's syndrome occurring together falls
on a hypergraph representation of the database. We describetwo similarity measures to compute the strength of associations
below some threshold, then
⟨fish oil,
Raynaud's syndrome⟩
between items. Specifically, we introduce the average commute
would not be picked up as a frequent itemset. However
fish
time similarity,
sCT, based on the random walk model on
oil and
Raynaud's syndrome may actually be meaningfully
hypergraph, and the inner-product similarity,
sL+, based on
related and the latent association can be revealed through
the Moore-Penrose pseudoinverse of the hypergraph Laplacian
indirect links. For instance, if
⟨blood changes,
fish oil⟩ and
matrix. Given semantically associated 2-itemsets generated bythese measures, we design a hypergraph expansion method
⟨blood changes,
Raynaud's syndrome⟩ are both frequent, the
with two search strategies, namely, the clique and connected
presence of
blood changes provides a connection through
component search, to generate
k-itemsets (
k > 2). We show
which
fish oil and
Raynaud's syndrome can be related (this
the proposed method is indeed capable of capturing seman-
relationship is discovered in Swanson's land mark paper
tically associated itemsets through experiments performed on
published in 1987 [1] before the age of Web-scale computa-
three datasets ranging from low to high dimensionality. Thesemantically associated itemsets discovered in our experiment
tion). We call the intermediary item such as
blood changes
is promising to provide valuable insights on interrelationship
in this case a
linking item, and the latent association between
between medical concepts and other domain specific concepts.
items through the connection via one or more linking itemsthe
semantically associated relationship. We present in this
Keywords-Semantically associated itemset, hypergraph, ran-
paper an algorithm to find semantically associated itemsets.
An object set endowed with pairwise relationships can
be naturally illustrated as a graph in which vertices rep-
resent objects, and any two vertices that have some kind
There has been a surge of interest in tackling the problem
of relationship are joined together by an edge. In the case
of mining linked collection of interrelated objects. A single
of frequent itemset mining, a set of objects with the co-
transaction table can be characterized as a set of items linked
occurrence relationship can be represented as directed or
by the co-occurrence relationship. The problem of frequent
undirected graphs. For illustrating this point of view, let us
itemset mining can be then formalized as to identify sets
consider a relational table depicted in Figure 1(a). One can
of items that often occur together, or, in other words, items
construct an undirected graph where the set of vertices is the
that heavily connected by co-occurrence links. An itemset
set of relational attributes (column items) and an edge joins
is deemed frequent if its
support, i.e., the percentage of
two vertices if the they co-occur in a tuple (as illustrated
transactions which contain that itemset, is above a threshold.
in Figure 1(b)). This graph is called
Gaifman graph [2] of
The support of an itemset can be viewed as a measure of
a relational structure. The undirected graph can be further
endorsement among items to each other in the set. However,
enriched by assigning to each edge a weight equal to the
it only takes into account the number of direct links between
support of the 2-itemset consisting of vertices incident to the
items while ignoring the number of indirect links (going
edge. Cliques (complete subgraphs) in the Gaifman graph, or
through intermediaries) between items. For example, given
Gaifman cliques for short, are of particular interest because
a relation table of annotated medical publications in a "bag-
every tuple (ground atom) in data corresponds to a Gaifman
of-word" representation where each row corresponds to one
clique. However, ambiguity arises as not all Gaifman cliques
(a) an example transaction table; (b) the Gaifman graph representation of the table; (c) The hypergraph representation of the table
have matching tuple in the data. There exists cases where
semantically associated
k-itemsets (
k > 2).
cliques are incidental in the sense that several relational
The rest of this paper is organized as follows. We intro-
ground atoms play together to induce a clique configuration
duce the basics of hypergraph and random walk model in
in the Gaifman graph, but no ground atom covers the entire
Section II. We review related work in Section III. We present
clique (e.g., the clique of
{A, B, C, D} in Figure 1(b) does
our method for discovering semantically associated itemsets
not correspond to any tuple in the relational table).
based on hypergraphs in Section IV. We report experimental
A natural way to remedy the ambiguity is to represent
results in Section V and conclude the paper in Section VI.
the relational data as a
hypergraph [3]. A hypergraph
II. PRELIMINARIES AND BACKGROUND
is a generalization of traditional graph. An edge in thehypergraph, called hyperedge, can connect more than two
A. Hypergraph
vertices. In other words, every hyperedge is an arbitrary
A hypergraph [3] is a generalization of a traditional graph
nonempty subset of vertices. It is obvious that a simple
where edges, called hyperedges, can connect any number
graph is a special kind of hypergraph with each hyperedge
of vertices. In other words, hyperedges can be viewed as
containing only two vertices. In this paper, we propose to
non-empty subsets of the vertices. Formally, a hypergraph
employ hypergraphs to model relational structure for finding
G = (
V, E), is a pair in which
V is the vertex set and
E
semantically associated itemsets. Specifically, we propose
is the hyperedge set where each
e ∈ E is a subset of
V .
to construct a hyperedge for each tuple. The relational at-
A weighted hypergraph is a hypergraph that has a positive
tributes constitute the universe of vertices in the hypergraph.
number
w(
e) associated with each hyperedge
e; called the
In this representation, each hyperedge has an exact one-to-
weight of hyperedge
e: Denote a weighted hypergraph by
one correspondent tuple (see Figure 1(c), for example).
G = (
V, E, w). The degree of a vertex
v ∈ V ,
d(
v), is
With the hypergraph model, finding semantically associ-
ated items amounts to developing a meaningful similarity
d(
v) =
measure between the items which takes into account the ef-
fect of linking items. Such similarity measure should satisfy
The degree of a hyperedge
e, denoted as
δ(
e), is the number
the intuition that the more "short" connections between two
of vertices in
e, i.e.
δ(
e) =
e . A hyperedge
e is said to
given items (through linking items), the more similar those
be incident with a vertex
v when
v ∈ e. The hypergraph
items are. To this end, we propose to employ the following
incidence matrix
H ∈ R
V × E is defined as
quantities as the candidate similarity measure since both
{ 1
, v ∈ e
of them have the desired property. They are, namely, the
h(
v, e) =
commute time distance based similarity measure from therandom walk model on hypergraph, and the inner product
Throughout the rest of the paper, the diagonal matrix forms
similarity based on the
pseudoinverse of the hypergraph
for
δ(
e),
w(
e),
d(
v) are denoted as
De,
W ∈ R
E , and
Laplacian. If the similarity of a pair of items measured
Dv ∈ Z
V , respectively.
by these quantities exceeds some threshold, then this pairof items can be deemed as a semantically associated 2-
B. Random Walk
itemset. Given 2-itemsets, we propose a hypergraph expan-
1) Random Walk on Simple Graph: Given a graph and
sion methods based on pruning of its primal graph together
a starting point we select a neighbor of it at random and
with two search strategies in the resulting graph to discover
move to this neighbor then we select a neighbor of this
point at random and move to it etc. The random sequence
in which
Pc denotes the co-occurrence property. The ran-
of points selected this way is a random walk on the graph.
dom walk model on hypergraph described in Section II-B2
In other words, a random walker can jump from vertex to
formalizes this point of view. Our method for measuring
vertex and each vertex therefore represents a state of the
the strength of semantic association is hence based on
Markov chain. The average first-passage time
m(
k i) [4] is
constructing a hypergraph representation and studying its
the average number of steps needed by a random walker for
property with both graph theoretical and spectral analysis
reaching state
k for the first time, when starting from state
i.
The symmetrized quantity
n(
i, j) =
m(
j i) +
m(
i j) called
Faloutsos et al. [10] defined a
connection subgraph as a
the average commute time [4], provides a distance measure
small subgraph of a large graph (such as social networks
between any pair of states. The fact that this quantity is
graphs) that best captures the relationship between two
indeed a distance on a graph was proved independently by
nodes. Since finding all paths connecting two nodes is
Klein and Randic [5] and Gobel and Jagers [6].
impractical and finding a single most "critical" path under
The Laplacian matrix
L of a graph is widely used for
some criteria is unfair, the connection subgraph addresses
finding many properties of the graphs in spectral graph
these problems by extracting a subset of all paths between
theory. Given node degree matrix
D and graph adjacency
two nodes that contains only the most significant ones to
matrix
A, the Laplacian matrix of the graph is defined
characterize their relationship under certain constraints. The
as
L =
D − A. The normalized Laplacian is given by
paths in the connection subgraphs are essentially "impor-
LN =
I − D−1
/2
AD−1
/2, where
I is the identity matrix.
tant" semantic associations. Faloutsos et al. described an
The average commute time
n(
i, j) can be computed in
efficient algorithm to produce approximate, but high-quality
closed form from the Moore-Penrose pseudoinverse [7] of
connection subgraphs in real time on very large graphs. In
L, denoted by
L+ as shown in Section IV.
comparison, our proposed method in the present paper does
2) Random Walk on Hypergraph: We can associate each
not explicitly keep track of paths (in hypergraphs). Instead,
hypergraph with a natural random walk which has the
paths are implicitly utilized in the random walk model to
transition rule as described in [8]. Given the current position
measure the similarity between nodes and our algorithms
u ∈ V , first choose a hyperedge
e over all hyperedges
to calculate the similarity are exact since the graphs that
incident with
u with the probability proportional to
w(
e),
we deal with are much smaller (approx. 10k nodes) than
and then choose a vertex
v ∈ e uniformly at random.
Faloutsos et al's (approx 15m nodes).
Obviously, it generalizes the natural random walk defined
on simple graphs. Let
P denote the transition probability
B. Indirect Associations
matrix of this hypergraph random walk. Then each entry of
P is
Tan et al. [11] introduced the concept of
indirect asso-
h(
u, e)
h(
v, e)
p(
u, v) =
ciation: Consider a pair of items (
a, b) with low support
value. If there is an itemset
Y (called the mediator set)
such that the presence of
a and
b are highly dependent on
In matrix notation,
P =
D−1
HWD−1
HT
items in
Y , then (
a, b) are said to be indirectly associated
Zhou et al. [8] define the following normalized hypergraph
via
Y . This definition of indirect association bears a similar
Laplacian
L based on the random walk model:
motivation to the semantically associated itemset proposed
L =
I − Θ,
in the present paper, as both problems aim at identifyingitemsets that do not have sufficiently high support but are
likely to provide useful insight into the data. The importance
of indirect association has also been recognized by severalother authors [12][13]. Tan et al. were the first to propose
III. RELATED WORK
an algorithm to derive the indirect associations by iteratively
A. Semantic Association and Connection Subgraph
finding mediator set for candidate itemsets. Later Wan et
Sheth et al. [9] proposed a formalism for
semantic associ-
al. [14] proposed a more efficient algorithm, called HI-Mine,
ation between entities in an RDF graph. Specifically, the se-
which improved the performance by avoiding the standard
mantic association is defined based on semantic connectivity
frequent itemset generation process.
which indicates if there exists a sequence of interconnected
The difference between the indirect and semantic associ-
links between two given entities. In our study of semantically
ation is that items in the indirect association are connected
associated itemsets in transaction data, the link between
via a mediator set while items in the semantic association
entities is essentially the ‘co-occurrence' relationship. The
are connected via a path. It is worth noting that in our
semantic association according to Sheth et al's definition
graph-based formalism for finding semantically associated
between transaction items
i0 and
in can be established by
itemsets, the paths connecting two items are not explicitly
identifying a link of the form
i0
, Pc, i1
, Pc, . . , in−1
, Pc, in,
required to be identified, while in both Tan et al's and Wan
et al's algorithms the mediator set
Y has to be identified
and when the length of paths decreases. This is indeed an
along the process of discovering indirect associations.
intuitively satisfying property of the effective resistance ofthe equivalent electrical network [20]. The usual shortest-
C. Random Walk-based Similarity Analysis
path distance (also called geodesic distance) does not have
Various quantities derived from random walk on graph has
this property: the shortest-path distance does not capture the
been used in a number of applications. Fouss et al. [15] com-
fact that strongly connected nodes are closer than weakly
prehensively compared twelve scoring algorithms based on
connected nodes.
graph representation of the database to perform collaborative
To compute commute-time distance between vertices in a
movie recommendation. Pan et al. [16] developed a similar-
hypergraph, we need to first define the combinatory hyper-
ity measure based on random walk steady state probability to
graph Laplacian
L. It follows from Zhou et al's formalism
discover correlation between multimedia objects containing
of normalized hypergraph Laplacian in Equation 2 that:
data of various modalities. Yen et al. [17] introduced a
L =
D1
/2
LD1
/2 =
D
new k-means clustering algorithm utilizing the random walk
v − HWD−1
average commute time distance. Zhou et al. [18] presented
The average commute time
n(
i, j) on simple graph can be
a unified framework based on neighborhood random walk
computed in closed form from the Moore-Penrose pseudoin-
to integrate structural and attribute similarities for graph
verse of
L [7], denoted by
L+ with elements
l+ = [
L+]
It can be shown that
n(
i, j) on hypergraph can be calculated
Palmer et al. [19] exploited a rich duality between random
in the same manner. The pseudoinverse
L+ is given by the
walks on graphs and electrical circuits to develop an external
following equation:
similarity function called REP to measure the similaritybetween categorical attributes. In their work, they identified
L+ = (
L − eeT /n)
−1 +
eeT /n,
a subtle flaw of commute time distance where it degenerates
where
e is a column vector made of 1s (i.e.,
e
on realistic data when the degree distribution follows a
[1
, 1
, . . , 1]
T ). The formula for the computation of
n(
i, j)
Zipf or power-low relationship. In other words, distances
takes the form of the following equation:
are skewed toward the high degree nodes. Our experimentresults based on commute time similarity confirmed this
n(
i, j) =
VG(
l+ +
l+
− 2
l+)
,
finding (see Table IV for illustration of this phenomena anddiscussion in Section V). We also discover that the inner-
where
VG =
tr(
Dv) is the volume of the hypergraph. If we
product based similarity does not suffer from this flaw.
define
ei as the
ith column of
I (i.e.,
ei = [0
i−1
i
0
, . . , 0]
T ), Equation 5 can be transformed to:
In this section, we present our method for discovering
n(
i, j) =
VG(
ei − ej)
T L+(
ei − ej)
,
semantically associated itemsets based on hypergraph. Ourmethod starts by generating 2-itemsets. A 2-itemset
⟨i, j⟩ is
Since
n(
i, j) is a distance, it is straightforward to convert
considered semantically associated if the hypergraph-based
it to a similarity measure
sCT (
i, j) by normalize it to unit
similarity measure
s(
i, j) exceeds some threshold. In the
range and subtract from 1.
following subsections, we propose two similarity measures
2) Pseudoinverse-based Inner-Product Similarity sL+
:
Equation 6 can be mapped into a new Euclidean space that
CT and
sL+ based on, respectively, the average commute
time distance on hypergraph and the inner-product-based
preserves the commute time distance:
representation of the pseudoinverse of Hypergraph Lapla-
n(
i, j) =
VG(
ei − ej)
T L+(
ei − ej)
cian. Given discovered semantically associated 2-itemsets,
we propose a hypergraph expansion method along with two
search strategies, namely, the clique and connected compo-
=
VG∥x′ − x′ ∥2
,
nent search, in the resulting graph for finding semantically
where
x′ =
Λ1
/2
UT e
associated
k-itemsets (
k > 2).
i,
U is an orthonormal matrix made
of eigenvectors of
L+ (ordered in decreasing order of cor-
A. Methods for Generating 2-itemsets
responding eigenvalue
λk) and
Λ =
Diag(
λk). In this way,
the transformed node vectors
x′ are exactly separated in the
In the following we describe two similarity measures that
define the strength of bond between a pair of semantically
new
n-dimensional Euclidean space. From this definition, it
follows that
L+ is the matrix containing inner products of
associated items.
the transformed vectors
x′ as shown below:
1) Average Commute Time Similarity s
mentioned, the commute-time distance
n(
i, j) between two
x′T x′ = (
Λ
j =
xT
nodes
i and
j has the desirable property of decreasing whenthe number of paths connecting the two nodes increases
=
eT UΛUT e
j =
eT
j =
l+
Therefore,
L+ can be considered as a similarity matrix for
vertex will find the entire connected component containing
the nodes—that is
the vertex. When the search returns, loop through othervertices and start a new search whenever the loop reaches
sL+(
i, j) =
l+
.
a vertex that has not already been included in a previously
The inner-product-based similarity measures are well-
found connected component.
studied for the vector-space model of information retrieval.
3) Ranking of Itemsets: Once the semantically associated
It has been shown that when computing proximities be-
2-itemsets and
k-itemsets are generated, they can be ranked
tween documents, inner-product-based measures outperform
by a quantity indicating the strength of association among
Euclidean distances [21].
items in the set. We tentatively compute this quantity byaveraging the total pairwise similarities over the number of
B. Methods for Generating k-itemset (k > 2
)
subedges of the itemset's corresponding clique or connected
Now, we consider finding semantically associated
k-
component in
G′(
H).
itemset (
k > 2) from given 2-itemsets. As is common in
C. Effective Computation
hypergraph theory, we can associate an induced graph
G(
H)
In high dimensional data sets, the computations of the
with every hypergraph
H by expanding every hyperedge
e
Hypergraph Laplacian and the pseudoinverse become in-
in
H to a clique in
G(
H). Edges in the induced graph
G(
H)
tractable. We discuss two approaches to mitigate this scala-
can be called
subedges to avoid unnecessary confusion. We
bility problem.
can further construct a pruned graph
G′(
H) from
G(
H) by
To compute Hypergraph Laplacian
L in Equation 3 re-
applying the following inclusion rule on each subedge: the
quires multiplication of hypergraph incidence matrices
H
similarity between the incident nodes of a subedge has to be
and its transpose
HT . Since
H grows in proportion to the
greater than a user-specified threshold
θ. In formal definition,
size of underlying transaction data (each node corresponds
given a hypergraph
H = (
V, E), the pruned subgraph is
to a column and each hyperedge corresponds to a row),
G′(
H) =
{V, E′} where
it eventually becomes unable to fit in memory when the
E′ =
{(
u, v)
∈ V 2 :
u ̸=
v and
size exceeds a certain amount. In this case the computation
u, v ∈ e for some
e ∈ E and
can still be carried out using a block partitioned matrixproduct by performing operations only on the submatrices
s(
u, v)
> θ}.
of tractable sizes. Owing to the fact that, in most cases,
V
Given
G′(
H), finding semantically associated
k-itemset
is much smaller than
E ,
H can then be partitioned into
(
k > 2) can be formulated into two ways: finding cliques or
s vertical stripes and the square matrix
De into
s diagonal
connected components in
G′(
H).
blocks. The multiplication in Equation 3 can be calculated
1) Cliques of G′(
H)
: Finding cliques in
G′(
H) cor-
by
HD−1
HT =
γ . Note that
H is sparse
responds to searching and testing in the powerset of
V .
in many applications. This property can be exploited to gain
Given the fact that every subset of a clique is also a clique,
high performance and due to its importance much effort has
this downward-closure property can make efficient clique
been devoted to the study resulting a number of libraries and
discovery algorithm possible in a way similar to the Apriori
routines from which we can leverage.
algorithm for finding frequent itemsets — with a "bottom
As the number of nodes grows, to compute pseudoinverse
up" manner, the candidate generation step extends valid
in closed form using Equation 4 also becomes intractable. A
k − 1 length itemsets one item at a time, and groups of
procedure based on Cholesky factorization to compute
L+
candidates are tested against
G′(
H) to determine if they
for large sparse matrices [22] is proved useful. It allows to
form cliques. The algorithm terminates when no further
compute
L+ in a column-by-column manner. In particular,
successful extensions are found.
the procedure involves the following steps for computing the
2) Connected Components of G′(
H)
: Complete subgraph
ith column of
L+:
(i.e., clique) is a very strong requirement that can limit
1) Compute the projection
yi of base vector
ei on the
the approach to restricted cases of semantically associated
column space of
L.
itemsets. One way to relax this requirement is to find
2) Find a solution
l∗+ of the linear system
Ll =
y
connected components of
G′(
H), which can be viewed
3) Project
l∗+ on the row space of
L to get
l+.
as a closure under semantic association. The number of
Since
L is symmetric, its row space is the same as column
connected components equals the multiplicity of 0 as an
space. The projection in step 1 and 2 can be represented
eigenvalue of the Laplacian matrix of
G′(
H). Although the
by the matrix (
I − eeT /n). The equation in step 2 can be
set of connected components is not downward closed, there
solved by first solving a reduced linear system:
ˆ
is efficient way to find all connected components of a graph
L,
ˆl, and
ˆ
y are obtained respectively by removing
in linear time using either breadth-first search or depth-first
the last row from
l,
y, and last row and column from
search. In either case, a search that begins at some particular
L. We observe that
ˆ
L is full rank and positive definite
and hence is able to be decomposed using the Cholesky
⟨ blood change, fish oil ⟩
L =
RR . Since
R is lower-triangular, one
⟨ blood change, Raynaud synd ⟩
solution of
ˆ
l =
RRTˆ
yi can be efficiently obtained
⟨ blood change, fish oil, Raynaud synd ⟩
by two back-substitutions. After solving the reduced linear
⟨ blood change, fish oil, f ⟩
⟨ blood change, f ⟩
system, the solution to the original equation in step 2 is
⟨ blood change, d ⟩
therefore (
l∗+) = [
ˆl∗+
, 0]
T . With the help of this technique,
⟨ blood change, b ⟩
we are able to analyze datasets of a million rows and 10
⟨ blood change, a ⟩
⟨ blood change, e ⟩
thousand columns.
⟨ blood change, c ⟩
⟨ fish oil, Raynaud synd ⟩
⟨ fish oil, f ⟩
V. EXPERIMENTAL EVALUATION
⟨ fish oil, d ⟩
⟨ fish oil, b ⟩
Because we are interested in understanding the differences
⟨ Raynaud synd, f ⟩
between the
sCT and
sL+ similarity measures for generatingsemantically associated itemsets, we conducted a series of
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sCT
experiments to highlight their differences. First, to illustrate
FROM THE SYNTHETIC FISH OIL DATASET.
the power of hypergraphs in finding associations via linkingitems, we synthesized a dataset for the
fish oil example.
Next, to illustrate the differences between the two methods,we evaluated both methods against a commonly used
shop-
B. Shopping Cart
ping cart dataset. Finally, encouraged by these results, weapplied these methods to actual
electronic health records to
1) Dataset: To better understand how the
sCT method
highlight their scalability and applicability to the medical
compares against the
sL+ method, we tested them on a busi-
ness shopping cart dataset. This dataset contains purchaseinformation on 100 grocery items (represented by booleancolumn headers) for 2,127 shopping orders (corresponding
A. Fish Oil
to tuples). We applied
sL+ and
sCT and set a threshold to
1) Dataset: As mentioned in Section I,
fish oil and
include top-100 2-itemsets, based on which we subsequently
Raynaud's syndrome have been shown by Swanson [1] to
used clique search to generate (
k > 2) itemsets. The top-
be linked together indirectly via various
blood changes.
10 2-itemset results and (
k > 2)-itemsets corresponding to
He found these associations from examining biomedical
maximum cliques generated by
sCT and
s+ are reported in
texts. As a proof of concept, we replicated this situation
Table II and III respectively.
by synthesizing a table of 50 rows, which is about the same
2) Results: Unlike the experiment on the fish oil dataset,
scale as in Swanson's experiment. Each row represents a set
We do not have specific hypothesis to validate in this test.
of terms generated to represent biomedical text. Each set
After examining the results from both measures, we can only
of terms was specifically generated so that
fish oil and
Ray-
conclude they make intuitive sense. However, we observe
naud's syndrome never appear together. The column headers
that the difference between the
sCT and
sL+ becomes more
include
fish oil, blood changes,
Raynaud's syndrome. Six
significant in this experiment. The
sCT tends to include
other random variables acted as noise. We then applied the
itemsets with high support and the effect of indirect links
sCT ,
sL+ to the dataset. Specifically, we set a threshold
is less pronounced. On the other hand,
sL+ promotes items
for first generating top-15 2-itemsets using either similarity
with support values towards the lower end. We also observe
measure. Based on the generated 2-itemsets we used clique
one drawback of the
sCT that the result is centered around
search to generate (
k > 2)-itemsets.
items with large frequencies (i.e., many direct links to other
2) Results: The hypergraph approach finds significant
nodes) and hence in a sense limiting the information (most
links between
fish oil and
Raynaud's syndrome, as demon-
itemsets are about
cheese,
soup and
cookie). By contrast, the
strated particularly well by the
sCT method as shown in
sL+ produces more diversified itemsets. This phenomenon
Table I. Even the triplet was discovered by the clique search
is illustrated in Table IV by comparing the rankings under
technique. Most notably, because their co-occurrence is zero,
sCT ,
sL+ and the ranking under support using Kendall-
τ
the association would never be discovered by traditional fre-
score. The degeneration of
sCT towards support is more
quent itemset techniques such as the Apriori algorithm [23].
pronounced in larger datasets as will be seen in the next
L+ method also picks-up the association, but it was
fairly weak: the association is ranked 23rd among all 2-
Finally we tested our methods on the dataset of electronic
itemsets (column 3 in Table I lists the ranking of the
sCT
health records of real patients. This dataset is different from
results given by the
sL+). However, as our next evaluations
the above two datasets not only in scale but also in practical
suggest, the
sL+ demonstrates other favorable qualities.
importance as described in the following.
⟨ Cheese, Soup ⟩
Electronic health
⟨ Cheese, Dried Fruit ⟩
⟨ Dried, Fruit Soup ⟩
⟨ Cookies, Soup ⟩
⟨ Cheese, Cookies ⟩
⟨ Cookies, Dried Fruit ⟩
THE KENDALL-
τ SCORE BETWEEN RANKINGS OF ITEMSETS
⟨ Cheese, Preserves ⟩
GENERATED BY
sCT ,
sL+ AND SUPPORT IN THE TWO EXPERIMENTS.
⟨ Cheese, Wine ⟩
⟨ Preserves, Soup ⟩
⟨ Soup, Wine ⟩
⟨ Canned Vegetables, Cheese,
Cookies, Dried Fruit, Frozen
1.6 million patients, 15 million encounters, 25 million coded
Vegetables, Nuts, Preserves,
ICD9 diagnoses, and a combination of pathology, radiology,
Soup, Wine ⟩
and transcription reports totaling over 9 million clinical notes
(i.e., unstructured text).
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sCT
From this set of 1.6 million patients, we extracted a cohort
FROM THE SHOPPING CART DATASET.
of patients that suffered from kidney failure. Out of thoserecords, we applied our algorithms to all previous records
in the patient's timeline, looking at just the set of drugs.
⟨ Sardines, Conditioner ⟩
Therefore, at a very simplistic level, the experiment result
⟨ Toothbrushes, Nasal Sprays ⟩
shows that semantically associated itemsets in this context
⟨ Yogurt, Anchovies ⟩
could possibly represent sets of drugs that could lead toward
⟨ Sports Magazines, Cottage Cheese ⟩
⟨ Tofu, Sour Cream ⟩
kidney failure when used in combination.
⟨ Toothbrushes, Acetominifen ⟩
2) Results: The cohort dataset described above contains
⟨ Sauces, Nasal Sprays ⟩
467791 rows (corresponding to patients' clinical notes) and
⟨ Sports Magazines, Gum ⟩
⟨ Sunglasses, Paper Dishes ⟩
10167 columns (corresponding to annotated terms appeared
⟨ Tofu, Canned Fruit ⟩
in the notes). With the help of the techniques described in
⟨ Canned Fruit, Sour Cream, Tofu ⟩
Section IV-C, we are able to compute
L+ in a tractable
⟨ Batteries, Cereal, Cooking Oil ⟩
⟨ Canned Vegetables, Nuts, Waffles ⟩
amount of time (Equation 3 and 4 are calculated within 4hours on a Quad-Core AMD Opteron(tm) Processor with 8
gigabyte memory), based on which we can efficiently derive
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sL+ FROM
the
sL+ itemsets. However, the calculation of
sCT on this
THE SHOPPING CART DATASET.
scale is intractable because an exact computation of all pair-wise
sCT requires to fill in a
V × V similarity table. In or-der to ameliorate the computational cost, we exploit domainknowledge to identify 582 terms of particular interest and
C. Electronic Health Records
then apply both
sCT and
sL+ on the reduced dataset. The
1) Dataset: In our third evaluation, we analyzed the elec-
results are shown in Table V and VI respectively, where we
tronic health records of real patients. Applying methods like
list top-10 2-itemsets and all (
k >2)-itemsets corresponding
the ones we have described to this kind of data is particularly
to the maximum clique.
relevant because of recent legislation aimed at increasing
It is clear that, continuing the trend shown in the shopping
the meaningful use of electronic health records. Discovering
cart analysis, the
sCT result becomes increasingly concor-
meaningful semantically associated itemsets among the set
dant with the support-based method. For illustrating this
of drugs and diseases identified in patients' clinical notes is a
point of view, we calculate the Kendall-
τ score between
critical step toward identifying combinations of drug classes
the ranking of itemsets generated by
sCT ,
sL+, and support
and co-morbidities, or risk-factors and co-morbidities that
as shown in Table IV. We observe from the table that as
are common in patients with a certain outcome (for example,
the
sCT converges to support, the
sL+ becomes even more
those suffering from myocardial infarction), toward building
distinct from it. The result is that the itemsets discovered by
predictive risk models, as well as toward providing probable
sCT contain mostly general terms that are repeatedly found
hypotheses about the possible causes of that outcome.
in the patients' notes. Although the association is reasonable
We obtained the set of drugs and diseases for each
but hardly interesting. On the contrary, the
sL+ result is not
patient's clinical note by using a new tool, the
Annotator
affected by the dimension of data as well as the presence of
Workflow, developed at the National Center for Biomedical
items with massive support. It identifies itemsets of relatively
Ontology (NCBO). The patient notes are from Stanford Hos-
low support but more closely bonded by indirect links.
pital's Clinical Data Warehouse (STRIDE). These records
To demonstrate the scalability of the method based on the
archive over 17-years worth of patient data comprising of
sL+, we also conducted the same analysis on the data of the
sCT Freq Itemset
39204
⟨ Calcium Chloride, Amiloride ⟩
⟨ White faced hornet venom, Yellow hornet venom ⟩
29325
⟨ Calcium Chloride, Aspirin ⟩
⟨ Trichloroacetic Acid, Trichloroacetate ⟩
28644
⟨ Calcium Chloride, Probenecid ⟩
⟨ Cloxacillin Sodium, benzathine cloxacillin ⟩
24805
⟨ Calcium Chloride, Furosemide ⟩
⟨ Methacycline, Methacycline hydrochloride ⟩
34271
⟨ Calcium Chloride, Calcium ⟩
⟨ Entamoebiasis, Hepatic, Liver Abscess, Amebic ⟩
2-itemsets 0.71 21481
⟨ Calcium Chloride, Disulfiram ⟩
⟨ butenafine, Butenafine hydrochloride ⟩
16814
⟨ Calcium Chloride, Amphetamine ⟩
⟨ Acetone, Cantharidin ⟩
19850
⟨ Calcium Chloride, Prednisone ⟩
⟨ ethyl cellulose, Cantharidin ⟩
12231
⟨ Aspirin, Amiloride ⟩
⟨ ethyl cellulose, Acetone ⟩
12106
⟨ Probenecid, Amiloride ⟩
⟨ Poloxamer 407, Eucalyptol ⟩
⟨ Calcium Chloride, Disul-firam, Amphetamine, Aceta-
minophen, Calcium, Aspirin,
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sL+ FROM
Probenecid, Amiloride, Pred-
THE WHOLE ELECTRONIC HEALTH DATASET AFTER 2010. THE DATASET
nisone, Furosemide ⟩
CONTAINS 1 MILLION ROWS AND 10K COLUMNS.
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sCT FROM
THE KIDNEY FAILURE COHORT OF THE ELECTRONIC HEALTH DATASET.
results are a matter of on-going evaluation with medicalexperts.
⟨ sevoflurane, remifentanil ⟩
ONCLUSION AND FUTURE WORK
⟨ frovatriptan, almotriptan ⟩
In this paper, we have proposed a method for dis-
⟨ Etomidate, Rocuronium ⟩
⟨ Atazanavir, Pyrimethamine ⟩
covering semantically associated itemsets. It is based on
⟨ ciclesonide, Fluorometholone ⟩
a hypergraph representation of the database where each
2-itemsets 0.377 231
⟨ naratriptan, Mefenamic Acid ⟩
column corresponds to a hypergraph node and each row
⟨ ciclesonide, Vincristine ⟩
⟨ Rocuronium, sevoflurane ⟩
corresponds to a hyperedge. We described two similarity
⟨ tazarotene, halobetasol propionate ⟩
measures to compute the strength of association between
⟨ Buprenorphine, alosetron ⟩
items which are used to generate semantically associated 2-
⟨ Ketorolac, Flurbiprofen, Ke-
torolac, Etodolac, Sulindac,
itemsets. Specifically, we introduced the average commute
Piroxicam, Ketoprofen ⟩
time similarity,
sCT , based on the random walk model onhypergraph, and the inner-product similarity,
sL+ , based
on the Moore-Penrose pseudoinverse of the hypergraph
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY
sL+ FROM
THE KIDNEY FAILURE COHORT OF THE ELECTRONIC HEALTH DATASET.
Laplacian matrix. Given generated 2-itemsets, we proposeda hypergraph expansion method with two search strategies,namely, the clique and connected component search, togenerate
k-itemsets (
k > 2).
whole cohort after 2010. The data consisted 1 million rows
We showed the proposed method is indeed capable of
and 10 thousand columns. We were able to produce the
sL+
capturing semantically associated itemsets through experi-
based 2-itemsets in 6 hours. The top results are shown in
ments performed on three datasets ranging from low to high
dimensionality. We observed that the
sCT tends to generate
The discovered
sL+ itemsets provide much valuable in-
concordant itemsets with those generated by support-based
sights on the possible interrelationship between drugs. Some
method as the dimensionality grows, while
sL+ performs
of them has been studied in the literature. For example,
well in consistently picking up items connected via indirect
sevoflurane/remifentanil can be used for anaesthesia;
frova-
links. The semantically associated itemsets discovered by
triptan and
almotriptan are both oral treatment of migraine
sL+ on the patients' clinical note dataset provide valuable
headache;
Etomidate and
Rocuronium can be used for rapid
insights on the possible interrelationship between drugs. The
sequence intubation; etc. This area of research is still very
draw back of this method is that the
sCT measure does
new and there are no good gold standards to compare our
not scale well for large datasets. We have to resort to "a
results against. However, for single-item drugs that lead to
priori" pruning of the data in the experiment. We are going to
kidney failure, SIDER1 database lists drugs and their side-
investigate iterative formulas and approximation algorithms
effects. Most notably, multi-itemsets are difficult to identify,
to improve the scalability.
but our methods have found not only
Ketoprofen but it has
Using hypergraph-based representation to model data has
also group other drugs like it (see the (
k > 2)-itemset shown
an important benefit of enabling systematic combination of
in Table VI, all of the items are anti-inflammatories). Our
top-down and bottom-up insight discovery methods. Domainknowledge encoded in ontologies has a natural graphical
representation. We will study methods to put the data graph
and knowledge graph together in a principled way to achieve
[12] I. D. Melamed, "Automatic Construction Of Clean Broad-
a synergy between data mining and domain knowledge.
Coverage Translation Lexicons," in
In Proceedings of the 2ndConference of the Association for Machine Translation in the
VII. ACKNOWLEDGMENT
Americas, pp. 125–134.
Haishan Liu and Dejing Dou's work in this paper is
[13] G. Das, H. Mannila, and P. Ronkainen, "Similarity of At-
partially supported by the NIH/NIBIB with Grant No.
tributes by External Probes," in
In Knowledge Discovery and
R01EB007684. Ruoming Jin's work in this paper is partially
Data Mining.
AAAI Press, 1997, pp. 23–29.
supported by National Science Foundation under CAREERAward IIS-0953950. Any opinions, findings, and conclusions
[14] Q. Wan and A. An, "Efficient mining of indirect associations
using hi-mine," in
In Proceedings of 16th Conference of the
or recommendations expressed in this material are those of
Canadian Society for Computational Studies of Intelligence,
the authors and do not necessarily reflect the views of the
AI 2003, 2003.
[15] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens,
"Random-walk computation of similarities between nodes ofa graph, with application to collaborative recommendation,"
[1] D. R. Swanson, "Two medical literatures that are logically
IEEE Transactions on Knowledge and Data Engineering,
but not bibliographically connected,"
Journal of the American
vol. 19, p. 2007, 2006.
Society for Information Science, no. 4, pp. 228–233, Jan.
[16] J.-Y. Pan, H. Yang, C. Faloutsos, and P. Duygulu, "Cross-
[2] I. Hodkinson and M. Otto, "Finite conformal hypergraph cov-
modal correlation mining using graph algorithms,"
Knowl-
ers and Gaifman cliques in finite structures,"
Bull. Symbolic
edge Discovery and Data Mining: Challenges and Realities
Logic, vol. 9, pp. 387–405, 2002.
with Real World Data.
[3] C. Berge, "Hypergraphs."
Bull. Symbolic Logic, 1989.
[17] L. Yen, L. Vanvyve, D. Wouters, F. Fouss, F. Verleysen, and
M. Saerens, "Clustering using a random-walk based distance
[4] L. Lovasz, "Random Walks on Graphs: A Survey," in
Com-
measure," in
Proceedings of ESANN'2005, 2005. [Online].
Budapest: Janos Bolyai Math. Soc., 1993, pp.
[18] Y. Zhou, H. Cheng, and J. X. Yu, "Graph clustering based
[5] D. Klein and M. Randic, "Resistance Distance,"
J. Math.
on structural/attribute similarities,"
Proc. VLDB Endow.,
Chemistry, vol. 12, pp. 81–95, 1993.
vol. 2, pp. 718–729, August 2009. [Online]. Available:http://portal.acm.org/citation.cfm?id=1687627.1687709
[6] F. Gobel and A. Jagers, "Random Walks on Graphs,"
Stochas-
tic Processes and Their Applications, vol. 2, pp. 311–336,
[19] C. R. Palmer and C. Faloutsos, "Electricity based external
similarity of categorical attributes," in
In PAKDD 2003.
Springer, 2003, pp. 486–500.
[7] S. Barnett, Ed.,
Matrices: Methods and Applications. Oxford
Univ. Press, 1992.
[20] P. Doyle and J. Snell, "Random Walks and Electric Net-
works,"
The Math. Assoc. of Am., 1984.
[8] D. Zhou, J. Huang, and B. Scholkopf, "Learning with hy-
pergraphs: Clustering, classification, and embedding," in
Ad-
[21] R. Baeza-Yates and B. Ribeiro-Neto, Eds.,
Modern Informa-
vances in Neural Information Processing Systems (NIPS) 19.
MIT Press, 2006, p. 2006.
[22] I. Herstein and D. Winter, Eds.,
Matrix Theory and Linear
[9] A. Sheth, B. Aleman-Meza, F. S. Arpinar, A. Sheth, C. Ra-
Maxwell Macmillan International Editions, 1988.
makrishnan, C. Bertram, Y. Warke, K. Anyanwu, B. Aleman-meza, I. B. Arpinar, K. Kochut, C. Halaschek, C. Ramakr-
[23] R. Agrawal and R. Srikant, "Fast algorithms for mining
ishnan, Y. Warke, D. Avant, F. S. Arpinar, K. Anyanwu, and
association rules in large databases," in
Proceedings of the
K. Kochut, "Semantic association identification and knowl-
20th International Conference on Very Large Data Bases, ser.
edge discovery for national security applications,"
Journal of
San Francisco, CA, USA: Morgan Kaufmann
Database Management, vol. 16, pp. 33–53, 2005.
Publishers Inc., 1994, pp. 487–499. [Online]. Available:http://portal.acm.org/citation.cfm?id=645920.672836
[10] C. Faloutsos, K. S. McCurley, and A. Tomkins, "Fast
discovery of connection subgraphs," in
Proceedings of thetenth ACM SIGKDD international conference on Knowledgediscovery and data mining, ser. KDD '04.
NY, USA: ACM, 2004, pp. 118–127. [Online]. Available:http://doi.acm.org/10.1145/1014052.1014068
[11] P.-N. Tan, V. Kumar, and J. Srivastava, "Indirect Association:
Mining Higher Order Dependencies in Data," in
IN PRINCI-PLES OF DATA MINING AND KNOWLEDGE DISCOVERY,2000, pp. 632–637.
Source: http://www.cs.kent.edu/~jin/Papers/icdm11_semanticitemsets.pdf
HIGHLIGHTS OF PRESCRIBING INFORMATION These highlights do not include all the information needed to use TIROSINT safely and effectively. • Patients unable to swallow a capsule, including young children (generally under 6 years of age) (4) See full prescribing information for TIROSINT. • Acute myocardial infarction (4) TIROSINT (levothyroxine sodium) capsules, for oral use
Evaluación intraoperatoria de la distensibilidad de la unión gastroesofágica en fundoplicatura Raul Aponte,1 Alberto Cardozo,2 Leonardo Rejon,2 Marjori Echenique,3 María Gabriela Cardozo,3 Johanan Davila,3 Maiveline Guardia4 1Neuro gastroenterólogo Director Médico Clínica Gastro Bariátrica, Maracay, Venezuela. 2Cirujano Bariátrico, Lap-