AI Research on Spam - Dictionary of Arguments

Author	Concept	Summary/Quotes	Sources
Philosophy Dictionary of Arguments Home


Spam: Spam refers to unsolicited, often unwanted, and frequently repetitive messages sent via electronic communication channels like email, messages, or comments. It commonly includes advertising, scams, or irrelevant content. See also Social media, Internet, Internet Culture. _____________ Annotation: The above characterizations of concepts are neither definitions nor exhausting presentations of problems related to them. Instead, they are intended to give a short introduction to the contributions below. – Lexicon of Arguments.

> AI Research	> Spam	AI Research on Spam - Dictionary of Arguments Norvig I 865 Spam/AI Research/Norvig/Russell: Language identification and genre classification are examples of text classification, as is sentiment analysis (classifying a movie or product review as positive or negative) and spam detection (classifying an email message as spam or not-spam). Since “not-spam” is awkward, researchers have coined the term ham for not-spam. We can treat spam detection as a problem in >Supervised learning. Norvig I 866 In the machine-learning approach we represent the message as a set of feature/value pairs and apply a classification algorithm h to the feature vector X. We can make the language-modeling and machine-learning approaches compatible by thinking of the n-grams as features. This is easiest to see with a unigram model. The features are the words in the vocabulary (…) and the values are the number of times each word appears in the message. That makes the feature vector large and sparse. If there are 100,000 words in the language model, then the feature vector has length 100,000, but for a short email message almost all the features will have count zero. This unigram representation has been called the bag of words model. You can think of the model as putting the words of the training corpus in a bag and then selecting words one at a time. The notion of order of the words is lost; a unigram model gives the same probability to any permutation of a text. Higher-order n-gram models maintain some local notion of word order. With bigrams and trigrams the number of features is squared or cubed, and we can add in other, non-n-gram features: the time the message was sent, whether a URL or an image is part of the message, an ID number for the sender of the message, the sender’s number of previous spam and ham messages, and so on. >Language Models/Norvig, >Data compression/Norvig. Norvig I 867 Data compression: To do classification by compression, we first lump together all the spam training messages and compress them as Norvig I 867 a unit. We do the same for the ham. Then when given a new message to classify, we append it to the spam messages and compress the result. We also append it to the ham and compress that. Whichever class compresses better—adds the fewer number of additional bytes for the new message—is the predicted class. The idea is that a spam message will tend to share dictionary entries with other spam messages and thus will compress better when appended to a collection that already contains the spam dictionary. Experiments with compression-based classification on some of the standard corpora for >Text classification. _____________ Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Concept/Author], [Author1]Vs[Author2] or [Author]Vs[term] resp. "problem:"/"solution:", "old:"/"new:" and "thesis:" is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.	AI Research Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010

Send Link

> Counter arguments against AI Research

> Counter arguments in relation to Spam

Authors A B C D E F G H I J K L M N O P Q R S T U V W Y Z

Concepts A B C D E F G H I J K L M N O P Q R S T U V W Z

Ed. Martin Schulz, access date 2024-04-20

Legal Notice Contact Data protection declaration