Commit 7a371865 authored by Jon Harington's avatar Jon Harington

Add README.md

parents
Implementation of the PkNN method for distributed knn search from the paper
"Pivot-based Distributed K-Nearest Neighbor Mining" by Caitlin Kuhlman, Yizhou Yan, Lei Cao, Elke Rundensteiner European Conference on Machine Learning, Principles and Practice of Knowledge Discovery (ECML-PKDD)
Research Track, Springer LNCS, 2017
PkNN search is conducted over 4 map-reduce jobs.
This can be run using the provided pre-compiled jar pknn.jar, configuration file conf/test.conf, and script run_all.sh
You must pass the run_all script a parameter 'm' which indicates the max number of datapoints to assign to a single machine.
This will depend on your configuration. For the provided data files, we suggest m = 150000 as a default.
usage: ./run_all.sh 150000
Test data from the openstreetmap dataset is provided in the data folder. Specify the path to your data as the dataset.input.dir property in the config file.
Experiments in the paper are run on public data available at the following sites:
OpenStreetMap public data. Extractions were downloaded from https://download.geofabrik.de/
The Tiger dataset is available from the US Census bureau and can be downloaded at https://www.census.gov/geo/maps-data/data/tiger.html
The SDSS dataset is available for download at http://stacks.iop.org/1538-3881/151/i=2/a=44
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment