KNN-GiST indexes were added in PostgreSQL 9.1 and greatly accelerate some common queries in the geospatial and textual search realms. This presentation will demonstrate the power of KNN-GiST indexes on geospatial and text searching queries, but also their present limitations through some of my experimentations. I will also discuss some of the theory behind KNN (k-nearest neighbor) as well as some of the applications this feature can be applied too.
To see a version of the talk given at PostgresOpen 2011, please visit http://www.youtube.com/watch?v=N-MD08QqGEM
Computer 10: Lesson 10 - Online Crimes and Hazards
Accelerating Local Search with PostgreSQL (KNN-Search)
1. Accelera'ng
Local
Search
With
PostgreSQL
9.1
Jonathan
S.
Katz
Exco
Ventures
PGConf.eu
2011
–
Oct
21,
2011
2. Local
Search?
• Not
necessarily
loca'on
based
on
places
• “How
close
are
two
en''es
to
one
another?”
• “What
are
the
closest
en''es
to
me?”
• “What
are
my
nearest
neighbors?”
2
3. Nearest
Neighbor
Overview
• Want
to
know
“how
similar”
objects
are
rela've
to
each
other
– What
are
the
top
“k”
choices
near
me?
• Need
to
define
a
“metric”
for
similarity
– “distance”
3
4. K-‐Nearest
Neighbor
• Given
a
collec'on
of
n
objects
• When
trying
to
classify
an
unknown
object
– compute
the
distance
between
all
known
objects
– find
the
k
(k
≥
1)
closest
objects
to
the
unknown
object
– classify
the
object
based
on
class
of
k
closest
objects
• When
k=1,
then
unknown
object
is
given
same
classifica'on
as
object
it
is
closest
to
4
6. Just
a
bit
more
theory…
• Voronoi
diagrams
of
order-‐1
are
created
in
O(n
log
n)
'me
– Looks
similar
to…?
:-‐)
• Queried
in
O(1)
'me
• Therefore:
– Pay
the
'me
penalty
to
build
the
index
– Query
against
index
quickly
6
8. So
what
about
PostgreSQL?
• As
of
PostgreSQL
9.0
– supports
geometric
types
and
distances
• Points,
circles,
lines,
boxes,
polygons
• Distance
operator:
<-‐>
– pg_trgm
–
supplied
module
for
determining
text
similarity
• similarity(“abc”,
“ade”)
computes
similarity
score
• <-‐>
defines
distance
(opposite
if
similarity),
not
defined
(in
9.0)
8
9. PostgreSQL
9.1:
KNN-‐GiST
• Let
n
=
size
of
a
table
• Can
index
data
that
provides
a
“<-‐>”
(distance)
operator
– Geometric
– pg_trgm
• “k”
=
LIMIT
clause
• Known
inefficiencies
when
k=n
and
n
is
small
9
10. Example:
pg_trgm
• Data:
– List
of
1,000,000
names
–
700,000
unique
– n
=
1,000,000
• Indexes:
– CREATE
INDEX
names_name_idx
ON
names
(name)
– CREATE
INDEX
trgm_idx
ON
names
USING
gist
(name
gist_trgm_ops)
• k=10
• Displaying
query
plan
/
execu'on
'me
aqer
10
runs
10
11. pg_trgm:
9.0
EXPLAIN
ANALYZE
SELECT
name,
similarity(name,
'jon')
AS
sim
FROM
names
WHERE
name
%
'jon'
ORDER
BY
sim
DESC
LIMIT
10;
11
15. Comparable?
• Seems
to
be
similar
– Need
to
do
more
research
why
–
anyone?
• However,
9.1
offers
improvements
for
LIKE/
ILIKE
search
with
pg_trgm
15
22. Drawbacks
• Performance
benefits
are
limited
when:
– LIMIT
is
close
to
size
of
data
set
and
data
set
is
large
– Data
set
is
small
• Time
to
build
index
– High
transac'on
table
22
23. Conclusions
• GiST:
“Generalized
Search
Tree”
–
index
is
there,
up
to
developers
to
define
access
methods
of
data
types
– e.g.
yields
KNN-‐GiST
• Different
types
of
applica'ons
can
be
built
–
performance
enhancements
• Next
steps?
23
24. My
Wish
List
• Further
geometric-‐type
support
in
Postgres
– N-‐dimensional
points
– ‘=‘
operator
for
point
type
– (PostGIS
s'll
champion
of
complex
geometric
+
geographic
data
types)
• Define
“distance”
over
mul'columns
with
different
types?
– SELECT
(a.name,
a.geocode)
<-‐>
(b.name,
b.geocode)
FROM
x
a,
x
b;
24
25. References
• Oleg
Bartunov
–
“Efficient
K-‐nearest
neighbour
search
in
PostgreSQL”
(h~p://www.sai.msu.su/~megera/
postgres/talks/pgday-‐2010.pdf)
• Oleg
Bartunov
and
Teodor
Sigaev
for
work
on
KNN-‐
GiST
and
notes
on
pg_trgm
(h~p://
developer.postgresql.org/pgdocs/postgres/
pgtrgm.html)
• Hubert
“depesz”
Lubaczewski
–
pa~erned
benchmarks
off
of
his
work
-‐
h~p://www.depesz.com/index.php/
2010/12/11/wai'ng-‐for-‐9-‐1-‐knngist/
25