A Hypergraph-based Method for Discovering Semantically Associated Itemsets Haishan Liu, Paea LePendu, Ruoming Jinand Dejing Dou ∗ Department of Computer and Information Science University of Oregon, Eugene, OR, 97403, USA Email: [email protected] & [email protected] † Stanford Center for Biomedical Informatics Research Stanford University, Stanford, CA, 94305, USA ‡ Computer Science Department Kent State University, Kent, OH, 44242, USA Abstract—In this paper, we address an interesting data publication and each column a boolean variable indicating mining problem of finding semantically associated itemsets, if some term is appeared. Using traditional frequent itemset i.e., items connected via indirect links. We propose a novel generation algorithms, if the number of times the item pair method for discovering semantically associated itemsets based fish oil and Raynaud's syndrome occurring together falls on a hypergraph representation of the database. We describetwo similarity measures to compute the strength of associations below some threshold, then ⟨fish oil, Raynaud's syndrome⟩ between items. Specifically, we introduce the average commute would not be picked up as a frequent itemset. However fish time similarity, sCT, based on the random walk model on
oil and Raynaud's syndrome may actually be meaningfully hypergraph, and the inner-product similarity, sL+, based on
related and the latent association can be revealed through the Moore-Penrose pseudoinverse of the hypergraph Laplacian indirect links. For instance, if ⟨blood changes, fish oil⟩ and matrix. Given semantically associated 2-itemsets generated bythese measures, we design a hypergraph expansion method ⟨blood changes, Raynaud's syndrome⟩ are both frequent, the with two search strategies, namely, the clique and connected presence of blood changes provides a connection through component search, to generate k-itemsets (k > 2). We show which fish oil and Raynaud's syndrome can be related (this the proposed method is indeed capable of capturing seman- relationship is discovered in Swanson's land mark paper tically associated itemsets through experiments performed on published in 1987 [1] before the age of Web-scale computa- three datasets ranging from low to high dimensionality. Thesemantically associated itemsets discovered in our experiment tion). We call the intermediary item such as blood changes is promising to provide valuable insights on interrelationship in this case a linking item, and the latent association between between medical concepts and other domain specific concepts.
items through the connection via one or more linking itemsthe semantically associated relationship. We present in this Keywords-Semantically associated itemset, hypergraph, ran- paper an algorithm to find semantically associated itemsets.
An object set endowed with pairwise relationships can be naturally illustrated as a graph in which vertices rep- resent objects, and any two vertices that have some kind There has been a surge of interest in tackling the problem of relationship are joined together by an edge. In the case of mining linked collection of interrelated objects. A single of frequent itemset mining, a set of objects with the co- transaction table can be characterized as a set of items linked occurrence relationship can be represented as directed or by the co-occurrence relationship. The problem of frequent undirected graphs. For illustrating this point of view, let us itemset mining can be then formalized as to identify sets consider a relational table depicted in Figure 1(a). One can of items that often occur together, or, in other words, items construct an undirected graph where the set of vertices is the that heavily connected by co-occurrence links. An itemset set of relational attributes (column items) and an edge joins is deemed frequent if its support, i.e., the percentage of two vertices if the they co-occur in a tuple (as illustrated transactions which contain that itemset, is above a threshold.
in Figure 1(b)). This graph is called Gaifman graph [2] of The support of an itemset can be viewed as a measure of a relational structure. The undirected graph can be further endorsement among items to each other in the set. However, enriched by assigning to each edge a weight equal to the it only takes into account the number of direct links between support of the 2-itemset consisting of vertices incident to the items while ignoring the number of indirect links (going edge. Cliques (complete subgraphs) in the Gaifman graph, or through intermediaries) between items. For example, given Gaifman cliques for short, are of particular interest because a relation table of annotated medical publications in a "bag- every tuple (ground atom) in data corresponds to a Gaifman of-word" representation where each row corresponds to one clique. However, ambiguity arises as not all Gaifman cliques (a) an example transaction table; (b) the Gaifman graph representation of the table; (c) The hypergraph representation of the table have matching tuple in the data. There exists cases where semantically associated k-itemsets (k > 2).
cliques are incidental in the sense that several relational The rest of this paper is organized as follows. We intro- ground atoms play together to induce a clique configuration duce the basics of hypergraph and random walk model in in the Gaifman graph, but no ground atom covers the entire Section II. We review related work in Section III. We present clique (e.g., the clique of {A, B, C, D} in Figure 1(b) does our method for discovering semantically associated itemsets not correspond to any tuple in the relational table).
based on hypergraphs in Section IV. We report experimental A natural way to remedy the ambiguity is to represent results in Section V and conclude the paper in Section VI.
the relational data as a hypergraph [3]. A hypergraph II. PRELIMINARIES AND BACKGROUND is a generalization of traditional graph. An edge in thehypergraph, called hyperedge, can connect more than two A. Hypergraph vertices. In other words, every hyperedge is an arbitrary A hypergraph [3] is a generalization of a traditional graph nonempty subset of vertices. It is obvious that a simple where edges, called hyperedges, can connect any number graph is a special kind of hypergraph with each hyperedge of vertices. In other words, hyperedges can be viewed as containing only two vertices. In this paper, we propose to non-empty subsets of the vertices. Formally, a hypergraph employ hypergraphs to model relational structure for finding G = (V, E), is a pair in which V is the vertex set and E semantically associated itemsets. Specifically, we propose is the hyperedge set where each e ∈ E is a subset of V .
to construct a hyperedge for each tuple. The relational at- A weighted hypergraph is a hypergraph that has a positive tributes constitute the universe of vertices in the hypergraph.
number w(e) associated with each hyperedge e; called the In this representation, each hyperedge has an exact one-to- weight of hyperedge e: Denote a weighted hypergraph by one correspondent tuple (see Figure 1(c), for example).
G = (V, E, w). The degree of a vertex v ∈ V , d(v), is With the hypergraph model, finding semantically associ- ated items amounts to developing a meaningful similarity d(v) = measure between the items which takes into account the ef- fect of linking items. Such similarity measure should satisfy The degree of a hyperedge e, denoted as δ(e), is the number the intuition that the more "short" connections between two of vertices in e, i.e. δ(e) = e . A hyperedge e is said to given items (through linking items), the more similar those be incident with a vertex v when v ∈ e. The hypergraph items are. To this end, we propose to employ the following incidence matrix H R V × E is defined as
quantities as the candidate similarity measure since both { 1, v ∈ e of them have the desired property. They are, namely, the h(v, e) = commute time distance based similarity measure from therandom walk model on hypergraph, and the inner product Throughout the rest of the paper, the diagonal matrix forms similarity based on the pseudoinverse of the hypergraph for δ(e), w(e), d(v) are denoted as De, W R E , and
Laplacian. If the similarity of a pair of items measured Dv ∈ Z V , respectively.
by these quantities exceeds some threshold, then this pairof items can be deemed as a semantically associated 2- B. Random Walk itemset. Given 2-itemsets, we propose a hypergraph expan- 1) Random Walk on Simple Graph: Given a graph and sion methods based on pruning of its primal graph together a starting point we select a neighbor of it at random and with two search strategies in the resulting graph to discover move to this neighbor then we select a neighbor of this point at random and move to it etc. The random sequence in which Pc denotes the co-occurrence property. The ran- of points selected this way is a random walk on the graph.
dom walk model on hypergraph described in Section II-B2 In other words, a random walker can jump from vertex to formalizes this point of view. Our method for measuring vertex and each vertex therefore represents a state of the the strength of semantic association is hence based on Markov chain. The average first-passage time m(k i) [4] is constructing a hypergraph representation and studying its the average number of steps needed by a random walker for property with both graph theoretical and spectral analysis reaching state k for the first time, when starting from state i.
The symmetrized quantity n(i, j) = m(j i) + m(i j) called Faloutsos et al. [10] defined a connection subgraph as a the average commute time [4], provides a distance measure small subgraph of a large graph (such as social networks between any pair of states. The fact that this quantity is graphs) that best captures the relationship between two indeed a distance on a graph was proved independently by nodes. Since finding all paths connecting two nodes is Klein and Randic [5] and Gobel and Jagers [6].
impractical and finding a single most "critical" path under The Laplacian matrix L of a graph is widely used for
some criteria is unfair, the connection subgraph addresses finding many properties of the graphs in spectral graph these problems by extracting a subset of all paths between theory. Given node degree matrix D and graph adjacency
two nodes that contains only the most significant ones to matrix A, the Laplacian matrix of the graph is defined
characterize their relationship under certain constraints. The as L = D A. The normalized Laplacian is given by
paths in the connection subgraphs are essentially "impor- LN = I D1/2AD1/2, where I is the identity matrix.
tant" semantic associations. Faloutsos et al. described an The average commute time n(i, j) can be computed in efficient algorithm to produce approximate, but high-quality closed form from the Moore-Penrose pseudoinverse [7] of connection subgraphs in real time on very large graphs. In L, denoted by L+ as shown in Section IV.
comparison, our proposed method in the present paper does 2) Random Walk on Hypergraph: We can associate each not explicitly keep track of paths (in hypergraphs). Instead, hypergraph with a natural random walk which has the paths are implicitly utilized in the random walk model to transition rule as described in [8]. Given the current position measure the similarity between nodes and our algorithms u ∈ V , first choose a hyperedge e over all hyperedges to calculate the similarity are exact since the graphs that incident with u with the probability proportional to w(e), we deal with are much smaller (approx. 10k nodes) than and then choose a vertex v ∈ e uniformly at random.
Faloutsos et al's (approx 15m nodes).
Obviously, it generalizes the natural random walk defined
on simple graphs. Let P denote the transition probability
B. Indirect Associations matrix of this hypergraph random walk. Then each entry of
P is
Tan et al. [11] introduced the concept of indirect asso- h(u, e) h(v, e) p(u, v) = ciation: Consider a pair of items (a, b) with low support value. If there is an itemset Y (called the mediator set) such that the presence of a and b are highly dependent on In matrix notation, P = D1HWD1HT
items in Y , then (a, b) are said to be indirectly associated Zhou et al. [8] define the following normalized hypergraph via Y . This definition of indirect association bears a similar Laplacian L based on the random walk model: motivation to the semantically associated itemset proposed L = I Θ,
in the present paper, as both problems aim at identifyingitemsets that do not have sufficiently high support but are likely to provide useful insight into the data. The importance of indirect association has also been recognized by severalother authors [12][13]. Tan et al. were the first to propose III. RELATED WORK an algorithm to derive the indirect associations by iteratively A. Semantic Association and Connection Subgraph finding mediator set for candidate itemsets. Later Wan et Sheth et al. [9] proposed a formalism for semantic associ- al. [14] proposed a more efficient algorithm, called HI-Mine, ation between entities in an RDF graph. Specifically, the se- which improved the performance by avoiding the standard mantic association is defined based on semantic connectivity frequent itemset generation process.
which indicates if there exists a sequence of interconnected The difference between the indirect and semantic associ- links between two given entities. In our study of semantically ation is that items in the indirect association are connected associated itemsets in transaction data, the link between via a mediator set while items in the semantic association entities is essentially the ‘co-occurrence' relationship. The are connected via a path. It is worth noting that in our semantic association according to Sheth et al's definition graph-based formalism for finding semantically associated between transaction items i0 and in can be established by itemsets, the paths connecting two items are not explicitly identifying a link of the form i0, Pc, i1, Pc, . . , in−1, Pc, in, required to be identified, while in both Tan et al's and Wan et al's algorithms the mediator set Y has to be identified and when the length of paths decreases. This is indeed an along the process of discovering indirect associations.
intuitively satisfying property of the effective resistance ofthe equivalent electrical network [20]. The usual shortest- C. Random Walk-based Similarity Analysis path distance (also called geodesic distance) does not have Various quantities derived from random walk on graph has this property: the shortest-path distance does not capture the been used in a number of applications. Fouss et al. [15] com- fact that strongly connected nodes are closer than weakly prehensively compared twelve scoring algorithms based on connected nodes.
graph representation of the database to perform collaborative To compute commute-time distance between vertices in a movie recommendation. Pan et al. [16] developed a similar- hypergraph, we need to first define the combinatory hyper- ity measure based on random walk steady state probability to graph Laplacian L. It follows from Zhou et al's formalism
discover correlation between multimedia objects containing of normalized hypergraph Laplacian in Equation 2 that: data of various modalities. Yen et al. [17] introduced a L = D1/2LD1/2 = D
new k-means clustering algorithm utilizing the random walk v − HWD1
average commute time distance. Zhou et al. [18] presented The average commute time n(i, j) on simple graph can be a unified framework based on neighborhood random walk computed in closed form from the Moore-Penrose pseudoin- to integrate structural and attribute similarities for graph verse of L [7], denoted by L+ with elements l+ = [L+]
It can be shown that n(i, j) on hypergraph can be calculated Palmer et al. [19] exploited a rich duality between random in the same manner. The pseudoinverse L+ is given by the
walks on graphs and electrical circuits to develop an external following equation: similarity function called REP to measure the similaritybetween categorical attributes. In their work, they identified L+ = (L eeT /n)1 + eeT /n,
a subtle flaw of commute time distance where it degenerates where e is a column vector made of 1s (i.e., e
on realistic data when the degree distribution follows a [1, 1, . . , 1]T ). The formula for the computation of n(i, j) Zipf or power-low relationship. In other words, distances takes the form of the following equation: are skewed toward the high degree nodes. Our experimentresults based on commute time similarity confirmed this n(i, j) = VG(l+ + l+ 2l+), finding (see Table IV for illustration of this phenomena anddiscussion in Section V). We also discover that the inner- where VG = tr(Dv) is the volume of the hypergraph. If we
product based similarity does not suffer from this flaw.
define ei as the ith column of I (i.e., ei = [0
i−1 i 0 , . . , 0]T ), Equation 5 can be transformed to: In this section, we present our method for discovering n(i, j) = VG(ei − ej)T L+(ei − ej),
semantically associated itemsets based on hypergraph. Ourmethod starts by generating 2-itemsets. A 2-itemset ⟨i, j⟩ is Since n(i, j) is a distance, it is straightforward to convert considered semantically associated if the hypergraph-based it to a similarity measure sCT (i, j) by normalize it to unit similarity measure s(i, j) exceeds some threshold. In the range and subtract from 1.
following subsections, we propose two similarity measures 2) Pseudoinverse-based Inner-Product Similarity sL+: Equation 6 can be mapped into a new Euclidean space that CT and sL+ based on, respectively, the average commute time distance on hypergraph and the inner-product-based preserves the commute time distance: representation of the pseudoinverse of Hypergraph Lapla- n(i, j) = VG(ei − ej)T L+(ei − ej)
cian. Given discovered semantically associated 2-itemsets, we propose a hypergraph expansion method along with two search strategies, namely, the clique and connected compo- = VG∥x′ − x′ ∥2,
nent search, in the resulting graph for finding semantically where x= Λ1/2UT e
associated k-itemsets (k > 2).
i, U is an orthonormal matrix made
of eigenvectors of L+ (ordered in decreasing order of cor-
A. Methods for Generating 2-itemsets responding eigenvalue λk) and Λ = Diag(λk). In this way,
the transformed node vectors xare exactly separated in the
In the following we describe two similarity measures that define the strength of bond between a pair of semantically new n-dimensional Euclidean space. From this definition, it
follows that L+ is the matrix containing inner products of
associated items.
the transformed vectors xas shown below:
1) Average Commute Time Similarity s mentioned, the commute-time distance n(i, j) between two x′T x= (Λ
j = xT
nodes i and j has the desirable property of decreasing whenthe number of paths connecting the two nodes increases = eT UΛUT e
j = eT
j = l+ Therefore, L+ can be considered as a similarity matrix for
vertex will find the entire connected component containing the nodes—that is the vertex. When the search returns, loop through othervertices and start a new search whenever the loop reaches sL+(i, j) = l+ . a vertex that has not already been included in a previously The inner-product-based similarity measures are well- found connected component.
studied for the vector-space model of information retrieval.
3) Ranking of Itemsets: Once the semantically associated It has been shown that when computing proximities be- 2-itemsets and k-itemsets are generated, they can be ranked tween documents, inner-product-based measures outperform by a quantity indicating the strength of association among Euclidean distances [21].
items in the set. We tentatively compute this quantity byaveraging the total pairwise similarities over the number of B. Methods for Generating k-itemset (k > 2) subedges of the itemset's corresponding clique or connected Now, we consider finding semantically associated k- component in G′(H).
itemset (k > 2) from given 2-itemsets. As is common in C. Effective Computation hypergraph theory, we can associate an induced graph G(H) In high dimensional data sets, the computations of the with every hypergraph H by expanding every hyperedge e Hypergraph Laplacian and the pseudoinverse become in- in H to a clique in G(H). Edges in the induced graph G(H) tractable. We discuss two approaches to mitigate this scala- can be called subedges to avoid unnecessary confusion. We bility problem.
can further construct a pruned graph G′(H) from G(H) by To compute Hypergraph Laplacian L in Equation 3 re-
applying the following inclusion rule on each subedge: the quires multiplication of hypergraph incidence matrices H
similarity between the incident nodes of a subedge has to be and its transpose HT . Since H grows in proportion to the
greater than a user-specified threshold θ. In formal definition, size of underlying transaction data (each node corresponds given a hypergraph H = (V, E), the pruned subgraph is to a column and each hyperedge corresponds to a row), G′(H) = {V, E′} where it eventually becomes unable to fit in memory when the E′ = {(u, v) ∈ V 2 : u ̸= v and size exceeds a certain amount. In this case the computation u, v ∈ e for some e ∈ E and can still be carried out using a block partitioned matrixproduct by performing operations only on the submatrices s(u, v) > θ}. of tractable sizes. Owing to the fact that, in most cases, V Given G′(H), finding semantically associated k-itemset is much smaller than E , H can then be partitioned into
(k > 2) can be formulated into two ways: finding cliques or s vertical stripes and the square matrix De into s diagonal
connected components in G′(H).
blocks. The multiplication in Equation 3 can be calculated 1) Cliques of G′(H): Finding cliques in G′(H) cor- by HD1HT =
γ . Note that H is sparse
responds to searching and testing in the powerset of V .
in many applications. This property can be exploited to gain Given the fact that every subset of a clique is also a clique, high performance and due to its importance much effort has this downward-closure property can make efficient clique been devoted to the study resulting a number of libraries and discovery algorithm possible in a way similar to the Apriori routines from which we can leverage.
algorithm for finding frequent itemsets — with a "bottom As the number of nodes grows, to compute pseudoinverse up" manner, the candidate generation step extends valid in closed form using Equation 4 also becomes intractable. A k − 1 length itemsets one item at a time, and groups of procedure based on Cholesky factorization to compute L+
candidates are tested against G′(H) to determine if they for large sparse matrices [22] is proved useful. It allows to form cliques. The algorithm terminates when no further compute L+ in a column-by-column manner. In particular,
successful extensions are found.
the procedure involves the following steps for computing the 2) Connected Components of G′(H): Complete subgraph ith column of L+:
(i.e., clique) is a very strong requirement that can limit 1) Compute the projection yi of base vector ei on the
the approach to restricted cases of semantically associated column space of L.
itemsets. One way to relax this requirement is to find 2) Find a solution l∗+ of the linear system Ll = y
connected components of G′(H), which can be viewed 3) Project l∗+ on the row space of L to get l+.
as a closure under semantic association. The number of Since L is symmetric, its row space is the same as column
connected components equals the multiplicity of 0 as an space. The projection in step 1 and 2 can be represented eigenvalue of the Laplacian matrix of G′(H). Although the by the matrix (I eeT /n). The equation in step 2 can be
set of connected components is not downward closed, there solved by first solving a reduced linear system: ˆ
is efficient way to find all connected components of a graph L, ˆl, and ˆ
y are obtained respectively by removing
in linear time using either breadth-first search or depth-first the last row from l, y, and last row and column from
search. In either case, a search that begins at some particular L. We observe that ˆ
L is full rank and positive definite
and hence is able to be decomposed using the Cholesky ⟨ blood change, fish oil ⟩ L = RR . Since R is lower-triangular, one
⟨ blood change, Raynaud synd ⟩ solution of ˆ
l = RRTˆ
yi can be efficiently obtained
⟨ blood change, fish oil, Raynaud synd ⟩ by two back-substitutions. After solving the reduced linear ⟨ blood change, fish oil, f ⟩ ⟨ blood change, f ⟩ system, the solution to the original equation in step 2 is ⟨ blood change, d ⟩ therefore (l+) = [ˆl+, 0]T . With the help of this technique,
⟨ blood change, b ⟩ we are able to analyze datasets of a million rows and 10 ⟨ blood change, a ⟩ ⟨ blood change, e ⟩ thousand columns.
⟨ blood change, c ⟩ ⟨ fish oil, Raynaud synd ⟩ ⟨ fish oil, f ⟩ V. EXPERIMENTAL EVALUATION ⟨ fish oil, d ⟩ ⟨ fish oil, b ⟩ Because we are interested in understanding the differences ⟨ Raynaud synd, f ⟩ between the sCT and sL+ similarity measures for generatingsemantically associated itemsets, we conducted a series of TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY sCT experiments to highlight their differences. First, to illustrate FROM THE SYNTHETIC FISH OIL DATASET.
the power of hypergraphs in finding associations via linkingitems, we synthesized a dataset for the fish oil example.
Next, to illustrate the differences between the two methods,we evaluated both methods against a commonly used shop- B. Shopping Cart ping cart dataset. Finally, encouraged by these results, weapplied these methods to actual electronic health records to 1) Dataset: To better understand how the sCT method highlight their scalability and applicability to the medical compares against the sL+ method, we tested them on a busi- ness shopping cart dataset. This dataset contains purchaseinformation on 100 grocery items (represented by booleancolumn headers) for 2,127 shopping orders (corresponding A. Fish Oil to tuples). We applied sL+ and sCT and set a threshold to 1) Dataset: As mentioned in Section I, fish oil and include top-100 2-itemsets, based on which we subsequently Raynaud's syndrome have been shown by Swanson [1] to used clique search to generate (k > 2) itemsets. The top- be linked together indirectly via various blood changes.
10 2-itemset results and (k > 2)-itemsets corresponding to He found these associations from examining biomedical maximum cliques generated by sCT and s+ are reported in texts. As a proof of concept, we replicated this situation Table II and III respectively.
by synthesizing a table of 50 rows, which is about the same 2) Results: Unlike the experiment on the fish oil dataset, scale as in Swanson's experiment. Each row represents a set We do not have specific hypothesis to validate in this test.
of terms generated to represent biomedical text. Each set After examining the results from both measures, we can only of terms was specifically generated so that fish oil and Ray- conclude they make intuitive sense. However, we observe naud's syndrome never appear together. The column headers that the difference between the sCT and sL+ becomes more include fish oil, blood changes, Raynaud's syndrome. Six significant in this experiment. The sCT tends to include other random variables acted as noise. We then applied the itemsets with high support and the effect of indirect links sCT , sL+ to the dataset. Specifically, we set a threshold is less pronounced. On the other hand, sL+ promotes items for first generating top-15 2-itemsets using either similarity with support values towards the lower end. We also observe measure. Based on the generated 2-itemsets we used clique one drawback of the sCT that the result is centered around search to generate (k > 2)-itemsets.
items with large frequencies (i.e., many direct links to other 2) Results: The hypergraph approach finds significant nodes) and hence in a sense limiting the information (most links between fish oil and Raynaud's syndrome, as demon- itemsets are about cheese, soup and cookie). By contrast, the strated particularly well by the sCT method as shown in sL+ produces more diversified itemsets. This phenomenon Table I. Even the triplet was discovered by the clique search is illustrated in Table IV by comparing the rankings under technique. Most notably, because their co-occurrence is zero, sCT , sL+ and the ranking under support using Kendall-τ the association would never be discovered by traditional fre- score. The degeneration of sCT towards support is more quent itemset techniques such as the Apriori algorithm [23].
pronounced in larger datasets as will be seen in the next L+ method also picks-up the association, but it was fairly weak: the association is ranked 23rd among all 2- Finally we tested our methods on the dataset of electronic itemsets (column 3 in Table I lists the ranking of the sCT health records of real patients. This dataset is different from results given by the sL+). However, as our next evaluations the above two datasets not only in scale but also in practical suggest, the sL+ demonstrates other favorable qualities.
importance as described in the following.
⟨ Cheese, Soup ⟩ Electronic health ⟨ Cheese, Dried Fruit ⟩ ⟨ Dried, Fruit Soup ⟩ ⟨ Cookies, Soup ⟩ ⟨ Cheese, Cookies ⟩ ⟨ Cookies, Dried Fruit ⟩ THE KENDALL-τ SCORE BETWEEN RANKINGS OF ITEMSETS ⟨ Cheese, Preserves ⟩ GENERATED BY sCT , sL+ AND SUPPORT IN THE TWO EXPERIMENTS.
⟨ Cheese, Wine ⟩ ⟨ Preserves, Soup ⟩ ⟨ Soup, Wine ⟩ ⟨ Canned Vegetables, Cheese, Cookies, Dried Fruit, Frozen 1.6 million patients, 15 million encounters, 25 million coded Vegetables, Nuts, Preserves, ICD9 diagnoses, and a combination of pathology, radiology, Soup, Wine ⟩ and transcription reports totaling over 9 million clinical notes (i.e., unstructured text).
TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY sCT From this set of 1.6 million patients, we extracted a cohort FROM THE SHOPPING CART DATASET.
of patients that suffered from kidney failure. Out of thoserecords, we applied our algorithms to all previous records in the patient's timeline, looking at just the set of drugs.
⟨ Sardines, Conditioner ⟩ Therefore, at a very simplistic level, the experiment result ⟨ Toothbrushes, Nasal Sprays ⟩ shows that semantically associated itemsets in this context ⟨ Yogurt, Anchovies ⟩ could possibly represent sets of drugs that could lead toward ⟨ Sports Magazines, Cottage Cheese ⟩ ⟨ Tofu, Sour Cream ⟩ kidney failure when used in combination.
⟨ Toothbrushes, Acetominifen ⟩ 2) Results: The cohort dataset described above contains ⟨ Sauces, Nasal Sprays ⟩ 467791 rows (corresponding to patients' clinical notes) and ⟨ Sports Magazines, Gum ⟩ ⟨ Sunglasses, Paper Dishes ⟩ 10167 columns (corresponding to annotated terms appeared ⟨ Tofu, Canned Fruit ⟩ in the notes). With the help of the techniques described in ⟨ Canned Fruit, Sour Cream, Tofu ⟩ Section IV-C, we are able to compute L+ in a tractable ⟨ Batteries, Cereal, Cooking Oil ⟩ ⟨ Canned Vegetables, Nuts, Waffles ⟩ amount of time (Equation 3 and 4 are calculated within 4hours on a Quad-Core AMD Opteron(tm) Processor with 8 gigabyte memory), based on which we can efficiently derive TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY sL+ FROM the sL+ itemsets. However, the calculation of sCT on this THE SHOPPING CART DATASET.
scale is intractable because an exact computation of all pair-wise sCT requires to fill in a V × V similarity table. In or-der to ameliorate the computational cost, we exploit domainknowledge to identify 582 terms of particular interest and C. Electronic Health Records then apply both sCT and sL+ on the reduced dataset. The 1) Dataset: In our third evaluation, we analyzed the elec- results are shown in Table V and VI respectively, where we tronic health records of real patients. Applying methods like list top-10 2-itemsets and all (k >2)-itemsets corresponding the ones we have described to this kind of data is particularly to the maximum clique.
relevant because of recent legislation aimed at increasing It is clear that, continuing the trend shown in the shopping the meaningful use of electronic health records. Discovering cart analysis, the sCT result becomes increasingly concor- meaningful semantically associated itemsets among the set dant with the support-based method. For illustrating this of drugs and diseases identified in patients' clinical notes is a point of view, we calculate the Kendall-τ score between critical step toward identifying combinations of drug classes the ranking of itemsets generated by sCT , sL+, and support and co-morbidities, or risk-factors and co-morbidities that as shown in Table IV. We observe from the table that as are common in patients with a certain outcome (for example, the sCT converges to support, the sL+ becomes even more those suffering from myocardial infarction), toward building distinct from it. The result is that the itemsets discovered by predictive risk models, as well as toward providing probable sCT contain mostly general terms that are repeatedly found hypotheses about the possible causes of that outcome.
in the patients' notes. Although the association is reasonable We obtained the set of drugs and diseases for each but hardly interesting. On the contrary, the sL+ result is not patient's clinical note by using a new tool, the Annotator affected by the dimension of data as well as the presence of Workflow, developed at the National Center for Biomedical items with massive support. It identifies itemsets of relatively Ontology (NCBO). The patient notes are from Stanford Hos- low support but more closely bonded by indirect links.
pital's Clinical Data Warehouse (STRIDE). These records To demonstrate the scalability of the method based on the archive over 17-years worth of patient data comprising of sL+, we also conducted the same analysis on the data of the sCT Freq Itemset
39204 ⟨ Calcium Chloride, Amiloride ⟩ ⟨ White faced hornet venom, Yellow hornet venom ⟩ 29325 ⟨ Calcium Chloride, Aspirin ⟩ ⟨ Trichloroacetic Acid, Trichloroacetate ⟩ 28644 ⟨ Calcium Chloride, Probenecid ⟩ ⟨ Cloxacillin Sodium, benzathine cloxacillin ⟩ 24805 ⟨ Calcium Chloride, Furosemide ⟩ ⟨ Methacycline, Methacycline hydrochloride ⟩ 34271 ⟨ Calcium Chloride, Calcium ⟩ ⟨ Entamoebiasis, Hepatic, Liver Abscess, Amebic ⟩ 2-itemsets 0.71 21481 ⟨ Calcium Chloride, Disulfiram ⟩ ⟨ butenafine, Butenafine hydrochloride ⟩ 16814 ⟨ Calcium Chloride, Amphetamine ⟩ ⟨ Acetone, Cantharidin ⟩ 19850 ⟨ Calcium Chloride, Prednisone ⟩ ⟨ ethyl cellulose, Cantharidin ⟩ 12231 ⟨ Aspirin, Amiloride ⟩ ⟨ ethyl cellulose, Acetone ⟩ 12106 ⟨ Probenecid, Amiloride ⟩ ⟨ Poloxamer 407, Eucalyptol ⟩ ⟨ Calcium Chloride, Disul-firam, Amphetamine, Aceta- minophen, Calcium, Aspirin, TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY sL+ FROM Probenecid, Amiloride, Pred- THE WHOLE ELECTRONIC HEALTH DATASET AFTER 2010. THE DATASET nisone, Furosemide ⟩ CONTAINS 1 MILLION ROWS AND 10K COLUMNS.
results are a matter of on-going evaluation with medicalexperts.
⟨ sevoflurane, remifentanil ⟩ ONCLUSION AND FUTURE WORK ⟨ frovatriptan, almotriptan ⟩ In this paper, we have proposed a method for dis- ⟨ Etomidate, Rocuronium ⟩ ⟨ Atazanavir, Pyrimethamine ⟩ covering semantically associated itemsets. It is based on ⟨ ciclesonide, Fluorometholone ⟩ a hypergraph representation of the database where each 2-itemsets 0.377 231 ⟨ naratriptan, Mefenamic Acid ⟩ column corresponds to a hypergraph node and each row ⟨ ciclesonide, Vincristine ⟩ ⟨ Rocuronium, sevoflurane ⟩ corresponds to a hyperedge. We described two similarity ⟨ tazarotene, halobetasol propionate ⟩ measures to compute the strength of association between ⟨ Buprenorphine, alosetron ⟩ items which are used to generate semantically associated 2- ⟨ Ketorolac, Flurbiprofen, Ke- torolac, Etodolac, Sulindac, itemsets. Specifically, we introduced the average commute Piroxicam, Ketoprofen ⟩ time similarity, sCT , based on the random walk model onhypergraph, and the inner-product similarity, sL+ , based on the Moore-Penrose pseudoinverse of the hypergraph TOP SEMANTICALLY ASSOCIATED ITEMSETS GENERATED BY sL+ FROM THE KIDNEY FAILURE COHORT OF THE ELECTRONIC HEALTH DATASET.
Laplacian matrix. Given generated 2-itemsets, we proposeda hypergraph expansion method with two search strategies,namely, the clique and connected component search, togenerate k-itemsets (k > 2).
whole cohort after 2010. The data consisted 1 million rows We showed the proposed method is indeed capable of and 10 thousand columns. We were able to produce the sL+ capturing semantically associated itemsets through experi- based 2-itemsets in 6 hours. The top results are shown in ments performed on three datasets ranging from low to high dimensionality. We observed that the sCT tends to generate The discovered sL+ itemsets provide much valuable in- concordant itemsets with those generated by support-based sights on the possible interrelationship between drugs. Some method as the dimensionality grows, while sL+ performs of them has been studied in the literature. For example, well in consistently picking up items connected via indirect sevoflurane/remifentanil can be used for anaesthesia; frova- links. The semantically associated itemsets discovered by triptan and almotriptan are both oral treatment of migraine sL+ on the patients' clinical note dataset provide valuable headache; Etomidate and Rocuronium can be used for rapid insights on the possible interrelationship between drugs. The sequence intubation; etc. This area of research is still very draw back of this method is that the sCT measure does new and there are no good gold standards to compare our not scale well for large datasets. We have to resort to "a results against. However, for single-item drugs that lead to priori" pruning of the data in the experiment. We are going to kidney failure, SIDER1 database lists drugs and their side- investigate iterative formulas and approximation algorithms effects. Most notably, multi-itemsets are difficult to identify, to improve the scalability.
but our methods have found not only Ketoprofen but it has Using hypergraph-based representation to model data has also group other drugs like it (see the (k > 2)-itemset shown an important benefit of enabling systematic combination of in Table VI, all of the items are anti-inflammatories). Our top-down and bottom-up insight discovery methods. Domainknowledge encoded in ontologies has a natural graphical representation. We will study methods to put the data graph and knowledge graph together in a principled way to achieve [12] I. D. Melamed, "Automatic Construction Of Clean Broad- a synergy between data mining and domain knowledge.
Coverage Translation Lexicons," in In Proceedings of the 2ndConference of the Association for Machine Translation in the VII. ACKNOWLEDGMENT Americas, pp. 125–134.
Haishan Liu and Dejing Dou's work in this paper is [13] G. Das, H. Mannila, and P. Ronkainen, "Similarity of At- partially supported by the NIH/NIBIB with Grant No.
tributes by External Probes," in In Knowledge Discovery and R01EB007684. Ruoming Jin's work in this paper is partially Data Mining.
AAAI Press, 1997, pp. 23–29.
supported by National Science Foundation under CAREERAward IIS-0953950. Any opinions, findings, and conclusions [14] Q. Wan and A. An, "Efficient mining of indirect associations using hi-mine," in In Proceedings of 16th Conference of the or recommendations expressed in this material are those of Canadian Society for Computational Studies of Intelligence, the authors and do not necessarily reflect the views of the AI 2003, 2003.
[15] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, "Random-walk computation of similarities between nodes ofa graph, with application to collaborative recommendation," [1] D. R. Swanson, "Two medical literatures that are logically IEEE Transactions on Knowledge and Data Engineering, but not bibliographically connected," Journal of the American vol. 19, p. 2007, 2006.
Society for Information Science, no. 4, pp. 228–233, Jan.
[16] J.-Y. Pan, H. Yang, C. Faloutsos, and P. Duygulu, "Cross- [2] I. Hodkinson and M. Otto, "Finite conformal hypergraph cov- modal correlation mining using graph algorithms," Knowl- ers and Gaifman cliques in finite structures," Bull. Symbolic edge Discovery and Data Mining: Challenges and Realities Logic, vol. 9, pp. 387–405, 2002.
with Real World Data. [3] C. Berge, "Hypergraphs." Bull. Symbolic Logic, 1989.
[17] L. Yen, L. Vanvyve, D. Wouters, F. Fouss, F. Verleysen, and M. Saerens, "Clustering using a random-walk based distance [4] L. Lovasz, "Random Walks on Graphs: A Survey," in Com- measure," in Proceedings of ESANN'2005, 2005. [Online].
Budapest: Janos Bolyai Math. Soc., 1993, pp.
[18] Y. Zhou, H. Cheng, and J. X. Yu, "Graph clustering based [5] D. Klein and M. Randic, "Resistance Distance," J. Math. on structural/attribute similarities," Proc. VLDB Endow., Chemistry, vol. 12, pp. 81–95, 1993.
vol. 2, pp. 718–729, August 2009. [Online]. Available:http://portal.acm.org/citation.cfm?id=1687627.1687709 [6] F. Gobel and A. Jagers, "Random Walks on Graphs," Stochas- tic Processes and Their Applications, vol. 2, pp. 311–336, [19] C. R. Palmer and C. Faloutsos, "Electricity based external similarity of categorical attributes," in In PAKDD 2003.
Springer, 2003, pp. 486–500.
[7] S. Barnett, Ed., Matrices: Methods and Applications. Oxford Univ. Press, 1992.
[20] P. Doyle and J. Snell, "Random Walks and Electric Net- works," The Math. Assoc. of Am., 1984.
[8] D. Zhou, J. Huang, and B. Scholkopf, "Learning with hy- pergraphs: Clustering, classification, and embedding," in Ad- [21] R. Baeza-Yates and B. Ribeiro-Neto, Eds., Modern Informa- vances in Neural Information Processing Systems (NIPS) 19.
MIT Press, 2006, p. 2006.
[22] I. Herstein and D. Winter, Eds., Matrix Theory and Linear [9] A. Sheth, B. Aleman-Meza, F. S. Arpinar, A. Sheth, C. Ra- Maxwell Macmillan International Editions, 1988.
makrishnan, C. Bertram, Y. Warke, K. Anyanwu, B. Aleman-meza, I. B. Arpinar, K. Kochut, C. Halaschek, C. Ramakr- [23] R. Agrawal and R. Srikant, "Fast algorithms for mining ishnan, Y. Warke, D. Avant, F. S. Arpinar, K. Anyanwu, and association rules in large databases," in Proceedings of the K. Kochut, "Semantic association identification and knowl- 20th International Conference on Very Large Data Bases, ser.
edge discovery for national security applications," Journal of San Francisco, CA, USA: Morgan Kaufmann Database Management, vol. 16, pp. 33–53, 2005.
Publishers Inc., 1994, pp. 487–499. [Online]. Available:http://portal.acm.org/citation.cfm?id=645920.672836 [10] C. Faloutsos, K. S. McCurley, and A. Tomkins, "Fast discovery of connection subgraphs," in Proceedings of thetenth ACM SIGKDD international conference on Knowledgediscovery and data mining, ser. KDD '04.
NY, USA: ACM, 2004, pp. 118–127. [Online]. Available:http://doi.acm.org/10.1145/1014052.1014068 [11] P.-N. Tan, V. Kumar, and J. Srivastava, "Indirect Association: Mining Higher Order Dependencies in Data," in IN PRINCI-PLES OF DATA MINING AND KNOWLEDGE DISCOVERY,2000, pp. 632–637.

Source: http://www.cs.kent.edu/~jin/Papers/icdm11_semanticitemsets.pdf


HIGHLIGHTS OF PRESCRIBING INFORMATION These highlights do not include all the information needed to use TIROSINT safely and effectively. • Patients unable to swallow a capsule, including young children (generally under 6 years of age) (4) See full prescribing information for TIROSINT. • Acute myocardial infarction (4) TIROSINT (levothyroxine sodium) capsules, for oral use


Evaluación intraoperatoria de la distensibilidad de la unión gastroesofágica en fundoplicatura Raul Aponte,1 Alberto Cardozo,2 Leonardo Rejon,2 Marjori Echenique,3 María Gabriela Cardozo,3 Johanan Davila,3 Maiveline Guardia4 1Neuro gastroenterólogo Director Médico Clínica Gastro Bariátrica, Maracay, Venezuela. 2Cirujano Bariátrico, Lap-