Repository for my Master Thesis at Politecnico di Milano

"A comparative study on parallelization of streaming top-k algorithms in Kafka"

How to try it!

Run docker-compose up -d to build and run the images

Run the application using its jar file in out/artifacts/APP_NAME/

Run src/main/java/myapp/RatingsDriverTest.java, change the INPUT_TOPIC in the file based on the app you're testing.

Measurements

Centralized Benchmark

Run RatingsDriver with msstopk-scored-rated-movies-${DATASET}, INPUT_THROUGHPUT and DATASET as arguments.

Then: Run src/main/java/myapp/materializeScoreSort/PhysicalWindow/CentralizedMSSTopK.java with TOPK and DATASET as arguments

Check when reading offset << writing offset

Run cleanLatencyFile.py CentralizedMSSTopK/dataset/topKK_latency_5s.txt TOPK to get the final MaterializeSort/latency_5ms.csv

Run averageLatency.py CentralizedMSSTopK/dataset/topKK_latency_5s.txt to get average Latency of the experiment.

Centralized MinTopK

Run RatingsDriver with centralized-mintopk-scored-rated-movies-${DATASET}, INPUT_THROUGHPUT and DATASET as arguments.

Then:

Run the experiment using the 5 datasets that you can find in dataset/

Run java -jar out/artifacts/minTopK_jar/kafka-stream-tutorial.jar "src/main/java/myapp/minTopK/minTopK.env" topK dataset where topK and dataset are integer values.

Run cleanLatencyFile.py CentralizedMinTopK/topKK_latency_5ms.txt topK to get the final CentralizedMinTopK/topKK_latency_5ms.csv

Run averageLatency.py CentralizedMinTopK/topKK_latency_5ms.txt to get average Latency of the experiment.

Centralized MinTopK+N

Run the experiment using the 5 datasets that you can find in dataset/

Run java -jar out/artifacts/CentralizeMinTopKN_jar/CentralizeMinTopKN.jar "src/main/java/myapp/minTopKN/minTopKN.env" topK topN dataset where topK, topN and dataset are integer values.

Run RatingsDriver with centralized-mintopkn-scored-rated-movies, INPUT_THROUGHPUT and DATASET as arguments.

Run cleanLatencyFile.py CentralizedMinTopKN/topKK_latency_5ms.txt topK to get the final CentralizedMinTopKN/topKK_latency_5ms.csv

Run averageLatency.py CentralizedMinTopKN/topKK_latency_5ms.txt to get average Latency of the experiment.

Parallel Benchmark

Load input data running RatingsDriver with pdmss-scored-rated-movies-dataset${DATASET}, INPUT_THROUGHPUT and DATASET as arguments.

Then:

Run 3 instances of src/main/java/myapp/distributedMaterializeScoreSort/PhysicalWindow/PhysicalWindowDistributedMSS ENV_FILE TOPK DATASET #INSTANCE.

Run src/main/java/myapp/distributedMaterializeScoreSort/PhysicalWindow/PhysicalWindowCentralizedAggregatedSort ENV_FILE TOPK DATASET.

Run both files with src/main/java/myapp/distributedMaterializeScoreSort/PhysicalWindow/physicalWindowDisMSS.env as ENV_FILE argument.

Parallel MinTopK

Run RatingsDriver with dis-mintopk-scored-rated-movies-dataset${DATASET}, INPUT_THROUGHPUT and DATASET as arguments.

Then:

Run 3 instances of src/main/java/myapp/distributedMinTopK/DistributedMinTopK.java ENV_FILE TOPK DATASET #INSTANCE.

Run src/main/java/myapp/distributedMinTopK/CentralizedTopK.java ENV_FILE TOPK DATASET DisMinTopK.

Run both files with src/main/java/myapp/distributedMinTopK/disMinTopK.env as first argument.

WIP: Parallel MinTopK+N

Run 3 instances of src/main/java/myapp/distributedMinTopKN/DistributedMinTopKN.java ENV_FILE TOPK TOPN DATASET #INSTANCE.

Run src/main/java/myapp/distributedMinTopK/CentralizedTopK.java ENV_FILE TOPK DATASET DisMinTopKN.

Run both files with src/main/java/myapp/distributedMinTopKN/disMinTopKN.env as first argument.

Run RatingsDriver with dis-mintopkn-scored-rated-movies, INPUT_THROUGHPUT and DATASET as arguments.

Parallel measurements

Run ./measurements.sh NUM_INSTANCES DATASET ALGO TOPK to clean measurent files and compute distributed latency and total_time ALGO parameter can be one betweem DisMSSTopK and DisMinTopK
(For NUM_INSTANCES != 6 need to modify python script distributedLatency.py)

Average latency

Run python3 averageLatency.py CentralizedMinTopK/dataset0/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset1/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset2/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset3/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset4/500Krecords_1200_300_50K_latency_5s.csv

Mean and std

Mean and std per dataset

Run python3 measurementsPerDataset.py CentralizedMinTopK/dataset0/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset1/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset2/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset3/500Krecords_1200_300_50K_latency_5s.csv CentralizedMinTopK/dataset4/500Krecords_1200_300_50K_latency_5s.csv

Mean and std per topK

Run python3 measurements.py CentralizedMinTopK/dataset0/500Krecords_1200_300_2K_average.csv CentralizedMinTopK/dataset0/500Krecords_1200_300_10K_average.csv CentralizedMinTopK/dataset0/500Krecords_1200_300_50K_average.csv

Plot box plot

Latency Box plot per dataset with same topK

Run python3 latencyBoxPlotPerDataset.py CentralizedMinTopK/dataset0/500Krecords_1200_300_2K_latency_5s.csv CentralizedMinTopK/dataset1/500Krecords_1200_300_2K_latency_5s.csv CentralizedMinTopK/dataset2/500Krecords_1200_300_2K_latency_5s.csv CentralizedMinTopK/dataset3/500Krecords_1200_300_2K_latency_5s.csv CentralizedMinTopK/dataset4/500Krecords_1200_300_2K_latency_5s.csv TITLE

Latency Box plot per topK

Run python3 plotBoxPlot.py CentralizedMinTopK/dataset0/500Krecords_1200_300_2K_average.csv CentralizedMinTopK/dataset0/500Krecords_1200_300_10K_average.csv CentralizedMinTopK/dataset0/500Krecords_1200_300_50K_average.csv TITLE

Total Time Box plot per Algorithm

Run python3 totalTimeBoxPlotPerAlgo with the total_times_6instances.csv of each algorithms

Total Time Box plot per Number of Instances

Run python3 totalTimeBoxPlotPerInstances total_times_3instances.csv total_times_6instances.csv total_times_10instances.csv

Total Time Box plot per Top-K

Run python3 totalTimeBoxPlotPerTopK with the total_times_6instances.csv for each topk.

Name		Name	Last commit message	Last commit date
Latest commit History 402 Commits
.github/workflows		.github/workflows
dataset		dataset
jars		jars
measurements		measurements
pythonScripts		pythonScripts
src/main		src/main
.gitignore		.gitignore
Master_Thesis.pdf		Master_Thesis.pdf
README.md		README.md
docker-compose.yml		docker-compose.yml
init.sql		init.sql
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository for my Master Thesis at Politecnico di Milano

"A comparative study on parallelization of streaming top-k algorithms in Kafka"

How to try it!

Measurements

Centralized Benchmark

Centralized MinTopK

Centralized MinTopK+N

Parallel Benchmark

Parallel MinTopK

WIP: Parallel MinTopK+N

Parallel measurements

Average latency

Mean and std

Mean and std per dataset

Mean and std per topK

Plot box plot

Latency Box plot per dataset with same topK

Latency Box plot per topK

Total Time Box plot per Algorithm

Total Time Box plot per Number of Instances

Total Time Box plot per Top-K

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repository for my Master Thesis at Politecnico di Milano

"A comparative study on parallelization of streaming top-k algorithms in Kafka"

How to try it!

Measurements

Centralized Benchmark

Centralized MinTopK

Centralized MinTopK+N

Parallel Benchmark

Parallel MinTopK

WIP: Parallel MinTopK+N

Parallel measurements

Average latency

Mean and std

Mean and std per dataset

Mean and std per topK

Plot box plot

Latency Box plot per dataset with same topK

Latency Box plot per topK

Total Time Box plot per Algorithm

Total Time Box plot per Number of Instances

Total Time Box plot per Top-K

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages