The Library of Dresan: Dr. Anthony G. Francis, Jr.'s Weblog

Introduction to Artificial Intelligence for Public Health

Rollins School of Public Health at Emory University
Instructor: Dr. Anthony G. Francis, Jr.

Lecture 12: Data Mining and Machine Learning

Modern computer systems enable the collection of vast amounts of data, far more than there are human experts to analyze and interpret. Data mining is the semiautomated extraction of patterns of knowledge from large amounts of raw data using techniques from machine learning, pattern recognition, statistics, linguistics, databases and scientific modeling. Data mining is used in science, business and computing to describe, classify, cluster, estimatate, predict and associate data, but it is not a cure-all: there are no purely automated tools which can just "mine" messy data sets to intuitively produce answers to our deepest questions. Effective data mining is often performed as an intensive, iterative process of understanding, preparing, analyzing and modeling data. Artificial intelligence techniques used in the modeling phase of data mining include neural networks, decision trees, self-organizing maps, and nearest neighbor algorithms.

Outline

  • What is Data Mining?
  • What Data Mining Can Do
  • What Data Mining Can't Do
  • Steps in Successful Data Mining
  • Machine Learning
  • Case Study: Learning Decision Trees
  • Other Machine Learning Algorithms
  • Success Stories in Data Mining

Readings:

  • Artificial Intelligence: Chapters 19 and 20
  • Machines Who Think: Afterword and Timeline

What is Data Mining?

What is the problem?

  • Computers enable the collection of massive amounts of data
    • Computer Security - Thousands of records per machine per day
    • Medical Records - Hundreds of thousands of cases of even rare diseases
    • Law Enforcement - Hundreds of thousands of crimes per jurisdiction per year
    • Bioinformatics - Gigabytes of gene sequence data
    • Space Research - Terabytes of satellite imagery
    • Business Data - Terabyte-sized data warehouses
  • Not enough trained human experts exist to analyze or interpret the data
  • Some problems are so large that human expertise can't be applied
    • Hundreds of columns (features)
    • Hundreds of thousands of rows (records)
    • Gigabytes of data (possibly distributed)
  • One potential solution is data mining

Data Mining: Extracting Knowledge from Data

Data mining is the process of discovering meaningful correlations, patterns and trends from large repositories of raw data. Data mining exploits domain knowledge, databases, artificial intelligence, and the scientific method.
  • Domain Knowledge: Define questions (business, science, medical)
  • Databases: Collect, maintain and prepare vast amounts of data
  • Statistics: Analyze data to find candidate subsets/techniques
  • Artificial Intelligence: Extract knowledge from the data
  • Scientific Method: Analysis of results, feedback to earlier stages
  • Showmanship: Publish, share or exploit learned knowledge

Real-World Examples of Data Mining

Each of these examples is from a deployed system.
  • Learning that electrolyte content in sweat may predict cystic fibrosis prognosis
  • Identifying a series of crimes as being related to a known set of offenders
  • Determining that certain auto design features leads to more electrical problems

Strengths and Weaknesses of Data Mining

Capabilities of Data Mining

  • Describe: Illustrate patterns and trends within data
  • Cluster: Identify groups of similar data
  • Classify: Group data into predefined classes
  • Estimate: Label data with numerical attributes
  • Predict: Classify/estimate for data points in the future
  • Associate: Extract rules from the data set

Limitations of Data Mining

Data mining tools:
  • Cannot automatically process data repositories to answer questions
  • Cannot operate without human oversight
  • Do not pay for themselves overnight
  • Are not easy to use
  • Will not find the causes behind problems
  • Do not automatically clean up messy data

Data Mining is easy to do badly!

  • Not a silver bullet
  • Not completely automated
  • Easy to get wrong
  • No guarantee that answers exist for you to mine!
  • But can occasionally provide insights

Steps in Successful Data Mining

Successful data mining usually involves a process model that applies structure to the exploration of the data and the extraction of knowledge from it.

Data Mining and Knowledge Discovery Process Models

  • CRISP-DM: Cross Industry Standard Process for Data Mining
  • Fayyad 9-stage model: more detail, same basic outline

The Six-Stage CRISP Model

  • Understanding the Problem Domain
    • List objectives
    • Define problem
    • Outline strategy
  • Understanding the Data
    • Exploratory data analysis
    • Evaluate data quality
    • Identify relevant subsets
  • Preparing the Data
    • Extract relevant subset from raw data
    • Select cases and variables for analysis
    • Clean and transform attributes
  • Modeling the Data
    • Select and apply models
    • Calibrate and optimize models
    • Return to data preparation if needed
  • Evaluating the Models
    • Evaluate models with respect to objectives
    • Determine whether additional objectives need to be met
    • Decide whether to continue modeling or to deploy results
  • Deploying the Knowledge
    • Generate reports on knowledge collected
    • Apply knowledge to affect outcomes
    • Continue or terminate data mining

Data Modeling with Artificial Intelligence

Typically, data modeling techniques are drawn from artificial intelligence, though data mining draws liberally upon statistics, pattern recognition, and information visualization. Typical AI techniques for data mining include:
  • Machine Learning: Extract knowledge from the features
  • Pattern Recognition: Extract knowledge from the features
  • Text Summarization: Extract additional features from text
  • Language Understanding: Extract knowledge directly from text
  • Vision: Extract additional features from images
While text summarization, language understanding and vision are now starting to be used, the primary techniques used in data mining are machine learning.

Machine Learning

There are many different kinds of machine learning, from rote memorization to scientific discovery.
  • Memorization: Learning by rote
  • Induction: Generalizing over examples
  • Deduction: Extracting knowledge from knowledge
  • Discovery: Self-guided exploration
The primary kind used in data mining is inductive learning, or learning from examples.

Kinds of Inductive Learning

Inductive learning problems can be categorized by the feedback available to the learning mechanism.
  • Supervised Learning: Learning a function from input to output
  • Unsupervised Learning: Learning unspecified patterns in data
  • Reinforcement Learning: Learning from rewards (or lack thereof)

A Model of Learning from Examples

Supervised learning can be viewed as the process of learning a function f from a set of inputs to a set of outputs given a set of examples which have the output provided:
  • E: the set of all possible examples < X, Y >
  • X: representation of a given example
    • Input attributes: X = {x1, x2 ... xn}
      • Boolean Attributes: xi in True,False
      • Categorical Attributes: xi in Category
      • Continuous Attributes: xi in Real
  • Y: representation of the desired output (if available)
    • Output attributes: Y = {y1, y2 ... yn}
      • Boolean Classification: Y = {y1 in True,False}
      • Multiattribute Classification: Y = {y1 in Category}
      • Unsupervised Learning: Y = {} = empty set
  • H: the space of possible hypotheses for the function f
    • Decision Trees: tree of if-then tests on inputs with outputs at leaves
    • Neural Networks: networks that map features to output
    • Nearest Neighbor: store previously seen examples and interpolate
    • Mathematical Expressions: compute best formula using regression
  • T: training set: the subset of examples used to train the algorithm
  • D: the distribution from which the training set is drawn
Goal: find a hypothesis that generalizes - predicts correctly for examples not in the training set.

Why we need to prepare the data

  • Missing attribute values - e.g., missing name or age field.
    Difficult to handle correctly:
    • Throw out example?
    • Use a default value?
    • Use the mean of the training set?
    • Draw randomly from the training set?
    • Find the most likely value for the other attributes?
  • Bad attribute values - e.g., incorrect name or age field
    Even more difficult to handle correctly!
    • Example: Zip codes like 30318, 90210, J2S7K7, 6269, 99999
    • Can't just throw out everything that's not a 5-digit number:
      • J2S7K7: zip code for St. Hyancinthe, Quebec, Canada
      • 6269: zip code for Storrs, Connecticut (06269)
      • 99999: probably an end-of-field marker, not a zip code
  • Redundant attributes - e.g., cell phone usage and cell phone charges.
    Can skew learning algorithms:
    • Two attributes measuring the same feature of reality
    • Correlated attributes can skew the importance of an association
  • Hidden attributes - e.g., stress level w.r.t. heart disease
    Examples may not have the right data to identify the pattern
    • Hidden attributes are also called hidden or lurking variables
    • Variables might not have a high enough resolution
  • Bad Attributes - invalid, spurious, or simply multivalued
    Can enable learning algorithms to find spurious correlations
    • Invalid attributes - e.g., an test field never filled in
    • Spurious attributes - e.g., a marker not correlated with the disease
    • Multivalued attributes - e.g., a name field unique to each example
  • Outliers - e.g., Bill Gates's income, Yao Ming's height
    • Individual "bad" examples can skew statistical functions
    • Various methods exist to deal with outliers
      • Min-Max Normalization: scale by the overall range
      • Z-Score Standardization: scale by the standard deviation
      • Interquartile Range: scale by range between 25% and 75% of data
  • Bad training data - distribution of examples can be skewed
    • A skewed series of examples will lead to a skewed hypothesis
    • Focus on wrong or irrelevant data

Why we need data analysis

We want to avoid overfitting --- providing learning hypotheses so powerful that they effectively recreate the input data.
  • Error: probability that the hypothesis will misclassify the input
  • Underfitting: having too weak a hypothesis space to account for the data
  • E.g., a linear function can't accurately represent a polynomial
  • E.g., a polynomial can't accurately represent a sine wave
  • Overfitting: learning spurious relationships in data
  • E.g., a nth-order polynomial may wiggle to fit the noise in your data
  • Particularly dangerous for lurking variables and multivalued attributes
  • Ockham's Razor: try to find the simplest possible hypothesis
  • PAC learning: find probably approximately correct hypotheses faster by focusing on smaller spaces of hypotheses
  • Learning Bias: knowledge used to restrict the set of hypotheses For example, for examples with n attributes, there are 2^2^n possible boolean decision trees, but only 2^n possible examples. So you can't do any better than a lookup table unless you have some knowledge that enables you to restrict the set of trees you build. However, restricting the set of hypotheses can exclude what you want to learn!
    • Confirmation Bias: human tendency to seek evidence that confirms the hypothesis we have, rather than evidence that contradicts it
    • Peeking: using knowledge of test data to improve the algorithm
    This combination of factors makes problem and data understanding both crucial and hard: we have to restrict our search for hypotheses in order to find anything, but we can just as easily preclude ourselves from finding out anything new by doing so!

    Case Study: Learning Decision Trees

    What are Decision Trees?

    A decision tree is a hierarchical set of if-then rules that partitions the space of examples based on functions of the attributes.
    • Decision Trees: partition based on boolean functions of attributes
    • Decision Lists: use a simple list of tests, not a tree
    • Continuous Inputs: use split-point functions rather than boolean tests
    • Continuous Outputs: use regression functions at each leaf of tree

    Decision Tree Learning

    Learning decision trees involves finding the best set of questions to ask to completely partition the examples. An algorithm for learning decision trees:
    • Given a set of examples, a set of attributes, and a default classification:
      • If there are no examples, return the default classification
      • If all the examples are in the same classification, return that
      • Select the category that best fits all the examples
      • Select the attribute that best splits the examples into categories
      • Create a new tree rooted on the splitting attribute
      • For each possible value of the splitting attribute, create a leaf of the tree by call this algorithm recursively with:
        • Examples: all examples with that value of the attribute
        • Attributes: all attributes except the splitting attribute
        • Default: current best fitting attribute

    How to select the best attribute

    Choose the attribute that provides the most information gain - intuitively, the most bang for the buck in splitting the remaining attributes.
    • Information: how much a piece of data helps us answer questions
    • Bit: the smallest amount of information - answer a yes/no question
      • Bits needed to represent n choices: log2(n)
    • Information provided by finding an answer that occurs m chances out of n:
      • Bits needed to represent we got - bits in all possible choices
        = log2(m) - log2(n)
        = log2(m/n)
      • But m/n is just the probability p of the choice!
        p = m/n
      • Furthermore, finding out one answer means we didn't get all other answers!
    • SO, the information provided by an answer given a set of choices is:
      the sum of the information content of each answer weighted by its probability
      • I( A | C1 ... CN )
        = I( P(C1) ... P(CN) )
        = ∑ -P(Ci) log2(P(Ci))
    • So the information provided by finding out that a coin is Heads is:
      • I( H | {H,T} )
        = I( P(H), P(T) )
        = -P(H) log2 P(H) - P(T) log2(P(T))
        = -½ log(½) - ½ log2(½)
        = ½ + ½
        = 1 bit
    • And the information on the roll of a four-sided die:
      • I( 1 | {1 ... 4} )
        = I( P(1) ... P(4) )
        = ∑ -P(i) log2(P(i))
        = ∑ -¼ log2(¼)
        = 4(-¼(-2))
        = 2 bits
    So to apply this to decision tree learning, we need to find the attribute that provides us the most information. For boolean classification, that's the attribute which gives us the best answer to the question of "is this an example or not".
    • Total positive examples: p
    • Total negative examples: n
    • Total examples: p+n
    • Total information in training set:
      • I( classification | examples )
        = I( P(positive), P(negative) )
        = I( P(positive), P(negative) )
        = -P(positive) log2 P(positive) - P(negative) log2 P(negative)
        = - p/(p+n) log(p/(p+n)) - n/(p+n) log2(n/(p+n)))
    • Remainder Information still needed to classify examples after we split on an attribute
      • Sum up the information in each subset of examples with a given value of the attribute:
      • ∑ P(example has attribute i) I( P(positive if attribute=i), P(negative if attribute=i) )
        = ∑ (pi+ni)/(p+n) I( pi/(pi+ni), ni/(pi+ni) )
        (pi+ni)/(p+n) [ - pi/(pi+ni) log(pi/(pi+ni)) - ni/(pi+ni)log(ni/(pi+ni)) ]
    • Information Gain Information we get from splitting on the attribute
      • Information Gain = Total Information - Remainder
    • For example, an attribute that completely splits the examples into positive and negative sets requires no additional information:
    One way to test the algorithm is to divide the examples into a test set and a training set to see whether or not the tree learned on the training set can accurately categorize the test set.

    Other Machine Learning Algorithms

    Instance-Based Learning

    Instance-based learning stores a population of examples and estimate values by returning results from one or more stored examples "nearest" to the input.
    • Memory-based reasoning: Store large numbers of examples, use "nearest" answer
    • k-Nearest-Neighbor: Find k nearest answers and interpolate (or use majority)
    • Case-based reasoning: Find most similar answer and "adapt" it
    Instance based learning mechanisms need a similarity function or distance metric to compute which stored examples are closest to the new test input. If the number of stored examples is very large, a memory retrieval system is required to make access to the nearest examples efficient.

    Clustering Algorithms

    Clustering algorithms take a population of examples and divide them into groups.
    • Hierarchical clustering: Group examples together, then group clusters recursively
    • k-Means Clustering: Select k examples as start clusters, then add nearest

    Neural Networks

    Neural networks act as function approximators.
    • Graphs of nodes connected by links with varying weights
      • Feedforward networks are directed acyclic graphs (no reverse links)
      • Recurrent networks allow back links
    • Feedforward neural networks can be broken down into two classes
      • Perceptrons have no hidden layers and compute linearly separable functions
      • Hidden layer networks have an intervening layer between input and output and can compute more complex functions
    • Backpropagation enables hidden layer feedforward networks to learn
      • Use the weights of the network to combine the inputs into the outputs
      • Compute the error (difference) between the expected and actual output
      • Use error to adjust weights recursively closer to producing the right answer

    Self-Organizing Maps

    Self-organizing maps use competitive learning.
    • Competition
      • Each attribute of the input vectors are fed to every node in the network
      • Each node computes a response to the input based on its weights
      • A scoring function evaluates the responses
      • The node with the best score is the "winning" node
    • Cooperation
      • Nodes which are close to the winning node receive partial activation
      • Winning nodes are thus in a neighborhood of active neurons
    • Adaptation
      • Active nodes are adjusted to reduce their error
      • The more active a node is, the more can adjust its weights
    The result is that a self-organizing map divides into a set of regions, each of which map a particular sub-population of examples. This is useful for categorization, data visualization, and dimensionality reduction.

    Bayesian Network Learning

    Bayesian networks model probabilistic relationships between variables. The greedy bayesian network learning algorithm:
    • Start with an empty network with no associations
    • Find the pair of varibles with the most correlation
    • Add a link to the network
    • Adjust the weights of the network
    • Repeat until all variance accounted for

    Other Learning Algorithms

    • A Priori Association Rule Learning
      • Find all "frequent itemsets" (e.g., smoking, high-fat diet, no exercise)
        • Use a threshold of minimum frequency
        • a priori property:
          if one set of items is not frequent, supersets containing it won't be either
      • For each frequent itemset, generate the set of statistically probable rules
        • Find all subsets of the itemset
        • Split the subsets rules by breaking them into IF and THEN partitions
        • Keep all the rules that are statistically probable
    • Version Space Learning
      • Graph search through the space of possible hypotheses
      • Works over conjunctions of features
        • most specific hypothesis: all examples output false
        • most general hypothesis: all examples output true
      • Keep the set of most general hypotheses and the single most specific hypothesis
      • For each example:
        • If it is a negative example, try to make the most specific hypothesis cover it
        • If it is a positive example, update the set of most general hypotheses
    • Support Vector Machines
      • Problem: Difficult to learn complicated nonlinear functions
      • Solution: Re-represent the problem in higher dimensional space
      • Result: Simple linearly separable learning problem (most of the time)
      • "Support vectors" are examples on the boundaries that define separation
    • Genetic Algorithms
      • Problem: Don't have a good handle on the right hypothesis space
      • Solution: Allow the system to evolve the representation
      • Requires: A fitness function that determines goodness of a result

    "Success" Stories in Data Mining

    Businesses do not in general publicize their data mining results --- they use them for competitive advantage. The following examples from healthcare and law enforcment are systems that were tested with real data in the field, but the learned results do not appear to have been applied to practice yet.

    Analyzing Cystic Fibrosis

    • Goal: Predict what factors indicate prognosis in cystic fibrosis patients
    • Data Source: 800+ Cystic Fibrosis Patients at U of Colorado
      • General patient information
      • FEV1% - forced expiratory volume in one second % predicted
      • Genotypes
      • Prognosis information
    • Data Preparation:
      • Correcting missing and nonsensical values
      • Removing bad attributes
      • Discretizing continuous variables
    • Learning Mechanism: "Data Squeezer" production rule learner
    • Learned Results:
      • Evaluated by experts from 1 (trivial) to 4 (interesting and novel)
      • Found 9 previously known useful results
      • Only 1 unknown result - sweat electrolytes indicate prognosis

    Tracking Offender Networks

    • Goal: Identify crimes that may have been committed by a group of perpetrators
    • Data Source: 48,000+ burglaries in the West Midlands Police Department
      • Crime date and location
      • Modus operandi checklist
      • Free text case narratives
    • Data Preparation:
      • Extracting modus operandi information from free text
      • Omitting records with key missing data
      • Temporal and spatial breakdowns
    • Learning Mechanism: Self-Organizing Map
    • Learned Results:
      • Could prepare lists for interview in 5 minutes (as opposed to 1-2 hours)
      • Thresholded lists only contained relevant crimes (as opposed to 5-10% accurate)

    Resources

    • Data Mining: ????
    • Machine Learning: ????
    • Neural Networks: ????
    • Decision Trees: ????
    • Clustering: ????
    • Self-Organizing Maps: ????
    • Nearest Neighbor: ????
    • Medical Data Mining: ????
    • Law Enforcement Data Mining: ????
  • Research
    Articles
    Classes
    Software

    Classic
    Weblog
    Wiki
    Store

    f@nu fiku
    Fiction
    Personal
    About

    Contact