Instance Based Classifiers:)
Rote-Learner:
Memorizes entire traing data and performs classification only if attribute record match one of the trainging examples.
Nearest neighbor:
- Uses k “closest” points (nearest neighbors) for performing classification
Nearest
Neighbor Classifiers:
Basic idea:
If t walks like a duck, quacks
like a duck, then it’s probably a duck.
Requires three things:
The
set of labeled
records
Distance
Metric to compute the distance between records
The
value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
Compute
distance to other training records
identify
k nearest neighbors
Use
class labels of nearest neighbors to determine the class label of unknown
record (e.g., by taking majority vote)
Voronoi Diagram:
Compute distance between two points:
Euclidean distance:
square root of sum of distance square.
Determine the class from the nearest neighbor
list:
Take the majority vote of class labels among the k-nearest
neighbors
Weigh the vote according to
distance
weight factor, w = 1/d2
Choosing the value of k:
If k is too small, sensitive to
noise points
If k is too large, neighborhood
may include points from other classes
Scaling issues:
–Attributes may have to be scaled
to prevent distance measures from being dominated by one of the attributes
–Example:
u height of a person may vary from
1.5m to 1.8m
u weight of a person may vary from
90lb to 300lb
u income of a person may vary from
$10K to $1M
Selection of the right similarity measure is critical:
Nearest neighbor Classification:
Nearest neighbor Classification:
k-NN classifiers are lazy learners since they
do not build models explicitly
Classifying unknown records are relatively
expensive
Can produce arbitrarily shaped decision
boundaries
Easy to handle variable interactions since
the decisions are based on local information
Selection of right proximity measure is
essential
Superfluous or redundant attributes can
create problems
Missing attributes are hard to handle
Improving KNN Efficiency:
Avoid having
to compute the b distance to all objects in the training set
–Multi-dimensional access methods (k-d trees)
–Fast approximate
similarity search
–Locality Sensitive Hashing (LSH)
Condensing:
Determine a smaller set of objects that give the same performance
Determine a smaller set of objects that give the same performance
Editing:
Remove objects to improve
efficiency :
KNN and
Proximity Graphs:
Proximity graphs
–a graph in which two vertices are connected by an edge if and only if the vertices satisfy particular geometric
requirements.
–nearest-neighbor graphs,
–minimum spanning trees
–Delaunay triangulations
–relative neighborhood graphs
–Gabriel graphs