1.29 KB
Newer Older
Jon Harington's avatar
Jon Harington committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Implementation of the PkNN method for distributed knn search from the paper 

"Pivot-based Distributed K-Nearest Neighbor Mining" by Caitlin Kuhlman, Yizhou Yan, Lei Cao, Elke Rundensteiner European Conference on Machine Learning, Principles and Practice of Knowledge Discovery (ECML-PKDD)
Research Track, Springer LNCS, 2017

PkNN search is conducted over 4 map-reduce jobs.

This can be run using the provided pre-compiled jar pknn.jar, configuration file conf/test.conf, and script

You must pass the run_all script a parameter 'm' which indicates the max number of datapoints to assign to a single machine. 
This will depend on your configuration. For the provided data files, we suggest m = 150000 as a default.

usage: ./ 150000

Test data from the openstreetmap dataset is provided in the data folder. Specify the path to your data as the dataset.input.dir property in the config file.

Experiments in the paper are run on public data available at the following sites:

OpenStreetMap public data. Extractions were downloaded from 

The Tiger dataset is available from the US Census bureau and can be downloaded at

The SDSS dataset is available for download at