Microsoft powerpoint - lecture9.ppt

Machine Learning and Data Mining Prof. Dr. Igor Trajkovski Text Classification Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Text Classification Applications – Recommending– Google / Yahoo like directory classification Newsgroup / Blog Messages – Recommending– spam filtering– Sentiment analysis for marketing News articles / Scientific articles – classification– Personalized newspaper – Prioritizing – Folderizing– spam filtering– Advertising on Gmail Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Text Categorization Methods Representations of text are very high dimensional – (one feature for each word).
Vectors are sparse since most words are rare.
– Zipf's law and heavy-tailed distributions The most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.
High-bias algorithms that prevent overfitting in high-dimensional space are best.
– SVMs maximize margin to avoid over-fitting in hi-D For most text categorization tasks, there are many irrelevant and many relevant features.
Methods that sum evidence from many or all features (e.g. naive Bayes, KNN, neural-net, SVM) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction).
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Naive Bayes for Text • Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w , w ,…w } based on the probabilities P(w c ).
• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/ V ) and m = V – Equivalent as seeing each word in each category exactly once.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Naive Bayes Generative Model for Text Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Naive Bayes Classification Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Text Naive Bayes Algorithm Let V be the vocabulary of all words in the documents in DFor each category c ∈ C Let D be the subset of documents in D in category c Let T be the concatenation of all the documents in D Let n be the total number of word occurrences in T For each word w ∈ V Let n be the number of occurrences of w in T Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Text Naive Bayes Algorithm Given a test document XLet n be the number of word occurrences in XReturn the category: argmax P(c ) P(a c ) c C where a is the word occurring the ith position in X Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log probability score is still the most probable.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Naive Bayes Posterior • Classification results of naive Bayes (the class with maximum posterior probability) are usually fairly accurate.
• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.
– Output probabilities are generally very close to 0 or 1.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Textual Similarity Metrics • Measuring similarity of two texts is a well-studied problem.
• Standard metrics are based on a "bag of words" model of a document that ignores word order and syntactic structure.
• May involve removing common "stop words" and stemming to reduce words to their root form.
• Vector-space model from Information Retrieval (IR) is the standard approach.
• Other metrics (e.g. edit-distance) are also used.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.
• These "orthogonal" terms form a vector space.
Dimension = t = vocabulary • Each term, i, in a document or query, j, is given a real-valued weight, wij.
• Both documents and queries are expressed as t-dimensional 2j , wtj Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Graphic Representation Example:D = 2T + 3T + 5T • Is D or D more similar to Q? • How to measure the degree of similarity? Distance? Angle? Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix.
• An entry in the matrix corresponds to the "weight" of a term in the document; zero means the term has no significance in the document or it simply doesn't exist in the document.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic.
f = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Term Weights: Inverse Document • Terms that appear in many different documents are less indicative of overall topic.
df = document frequency of term i = number of documents containing term i idf = inverse document frequency of term i, (N: total number of documents) • An indication of a term's discrimination power.
• Log used to dampen the effect relative to tf.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 • A typical combined term importance indicator is tf-idf • A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
• Many other ways of determining term weights have been • Experimentally, tf-idf has been found to work well.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors.
Inner product normalized by the vector lengths.
d q d j q CosSim(D , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 CosSim(D , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 D is 6 times better than D using cosine similarity but 5 times better using inner product.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Prototype Classification • Relevance feedback methods can be adapted for text • Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency).
• For each category, compute a prototype vector by summing the vectors of the training documents in the category.
• Assign test documents to the category with the closest prototype vector based on cosine similarity.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Illustration of Prototype Classification Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Prototype Classification Algorithm (Training) Assume the set of categories is {c , c ,…c } For i from 1 to n let p = <0, 0,…,0> (init. prototype vectors) For each training example <x, c(x)> ∈ D Let d be the frequency normalized TF/IDF term vector for doc xLet p Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Prototype Classification Given test document xLet d be the TF/IDF weighted term vector for xLet m = –2 (init. maximum cosSim)For i from 1 to n: (compute similarity to prototype vector)Let s = cosSim(d, p )iif s > m let m = slet r = c (update most similar class prototype) Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Prototype Classification Properties • Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the examples in each class (a prototype).
• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.
• Classification is based on similarity to class Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Prototype Classification Anomoly • Prototype models have problems with Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 Illustration of 3 Nearest Neighbor for Text Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 K Nearest Neighbor for Text Training:For each each training example <x, c(x)> ∈ D Compute the corresponding TF-IDF vector, d , for document x Test instance y:Compute TF-IDF vector d for document yFor each <x, c(x)> ∈ D Let s = cosSim(d, d ) Sort examples, x, in D by decreasing value of sxLet N be the first k examples in D. (get most similar neighbors)Return the majority class of examples in N Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 3 Nearest Neighbor Comparison • Nearest Neighbor tends to handle disjunctive categories well. Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 • Linear search through training texts is not scalable.
• An index that points from words to documents that contain them allows more rapid retrieval of similar documents.
• Once stop-words are eliminated, the remaining words are rare, so an inverted index narrows attention to a relatively small number of documents that share meaningful vocabulary with the test document.
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008 • Many important applications of classification to text.
• Requires an approach that works well with large, sparse features vectors, since typically each word is a feature and most words are rare.
– Naive Bayes– kNN with cosine similarity– SVMs Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski, NYUS, Spring 2008

Source: http://www.time.mk/trajkovski/teaching/aim/Lecture9.pdf

blackberry.com.tw

CONTRATO DE LICENCIA DE SOLUCIÓN BLACKBERRY LE ROGAMOS LEA EL PRESENTE DOCUMENTO DETENIDAMENTE ANTES DE INSTALAR O UTILIZAR EL SOFTWARE. ESTE CONTRATO CONTIENE DISPOSICIONES QUE LIMITAN O EXCLUYEN LA RESPONSABILIDAD DE RIM FRENTE A USTED Y QUE DE LO CONTRARIO AFECTAN SUS DERECHOS LEGALES. SEGÚN SU JURISDICCIÓN, ESTE CONTRATO TAMBIÉN PUEDE REQUERIR QUE USTED RECURRA A ARBITRAJE SOBRE UNA BASE INDIVIDUAL A LOS FINES DE RESOLVER CONFLICTOS EN LUGAR DE JUICIOS POR JURADO O ACCIONES COLECTIVAS. EL PRESENTE CONTRATO NO AFECTA SUS DERECHOS LEGALES OBLIGATORIOS APLICABLES EN SU JURISDICCIÓN, EN LA MEDIDA QUE TENGA DERECHO A LOS DERECHOS LEGALES OBLIGATORIOS CORRESPONDIENTES. Este Contrato de Licencia de Solución BlackBerry (el "Contrato") es un contrato legal entre usted: individualmente si usted lo acepta en su propio carácter; o si usted está autorizado para adquirir el Software (según se define abajo) en nombre de su compañía u otra entidad, entre la entidad para cuyo beneficio usted actúa (en cualquier caso, "Usted"), y Research In Motion Limited ("RIM") con sede social en 295 Phillip Street, Waterloo, Ontario, Canadá N2L 3W8 (conjuntamente las "Partes" e individualmente una "Parte"). En el contexto de la distribución de Productos/Servicios (según se definen abajo), RIM significa Research In Motion E-Commerce S.a.r.l u otra afiliada de RIM identificada como distribuidor de productos/servicios en Su Jurisdicción en http://www.blackberry.com/legal/rime ("RIME"). Si usted está utilizando el Software junto con un Dispositivo de mano en su carácter personal y en nombre de su compañía u otra entidad, en ese caso, "Usted" significará usted en su carácter personal para algunos elementos del Software y los Servicios de RIM, y significará la entidad en cuyo nombre usted actúa para otro Software y los Servicios de RIM (por ej. si la compañía para la cual usted trabaja lo autoriza a celebrar este Contrato con respecto al uso por su parte de una cuenta de correo electrónico de Servidor de empresa de BlackBerry ("BES") y de aplicaciones de gestión de información personal de BlackBerry ("Aplicaciones PIM"), pero no lo autoriza ni asume responsabilidad por el uso por su parte de otro software o los servicios, tales como el Software de cliente Windows Live Messenger o una Tienda RIME, en ese caso, "Usted" significa su compañía para la cuenta de correo electrónico BES y las Aplicaciones PIM, y "Usted" significa usted personalmente en relación al uso por su parte del Software de cliente Windows Live Messenger y la Tienda RIME). En relación con la licencia y distribución del Software, RIM es un licenciatario directo o indirecto de: (a) cualquiera una o más de sus subsidiarias y afiliadas (las subsidiarias y afiliadas correspondientes junto con RIM se denominan en este Contrato "Compañías del Grupo RIM"); o (b) un tercero licenciante para cualquiera de las Compañías del Grupo RIM inclusive RIM.

chennaiport.gov.in

MATERIALS MANAGEMENT DIVISION SPECIAL LIMITED TENDER NO. EMSA 7006 CLOSING DATE 27.11.2013. CLOSING TIME 14.30Hrs. Pre-bid /meeting will be held on 11.11.2013. The bidders are requested to attend the pre-bid meeting for clarifications about online bidding.-