Skip to content

CODEPUSH-4: Add data generator#4

Open
bharathappali wants to merge 5 commits into
kruize:mainfrom
bharathappali:add-data-gen
Open

CODEPUSH-4: Add data generator#4
bharathappali wants to merge 5 commits into
kruize:mainfrom
bharathappali:add-data-gen

Conversation

@bharathappali
Copy link
Copy Markdown
Member

This PR is built on #3

#3 Needs to be merged before this PR

This PR adds the data generation logic from the previously generated config. The results are written to a parquet file for better compression and disk usage for larger data sets.

Converters needs to be added on top of this PR to convert a portion or complete set of data in parquet.

Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Comment thread requirements.txt
@@ -0,0 +1 @@
pyspark~=3.5.1 No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharathappali I don't see this file being used, can you update the scripts to install this requirement

Comment thread data-gen.py
@@ -0,0 +1,303 @@
import argparse
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include copyright for all the files

@chandrams
Copy link
Copy Markdown
Contributor

@bharathappali - I get the below error when I run the data-gen.py, am I missing anything?

python data-gen.py --config-name config
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: package sun.security.action not in java.base
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/02/13 09:54:19 WARN Utils: Your hostname, li-a82bab4c-21dc-11b2-a85c-ad8b4a97c40c.ibm.com, resolves to a loopback address: 127.0.0.1; using 192.168.0.163 instead (on interface wlp9s0)
26/02/13 09:54:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/13 09:54:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::arrayBaseOffset has been called by org.apache.spark.unsafe.Platform (file:/home/chandrams/.local/lib/python3.14/site-packages/pyspark/jars/spark-unsafe_2.13-4.1.1.jar)
WARNING: Please consider reporting this to the maintainers of class org.apache.spark.unsafe.Platform
WARNING: sun.misc.Unsafe::arrayBaseOffset will be removed in a future release
[StructField('timestamp', IntegerType(), True), StructField('value', DoubleType(), True), StructField('metric_name', StringType(), True), StructField('container', StringType(), True), StructField('endpoint', StringType(), True), StructField('id', StringType(), True), StructField('image', StringType(), True), StructField('job', StringType(), True), StructField('namespace', StringType(), True), StructField('node', StringType(), True), StructField('pod', StringType(), True), StructField('service', StringType(), True)]
26/02/13 09:54:25 WARN FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.viewfs.ViewFileSystem could not be instantiated
	at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:552)
	at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:712)
	at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:672)
	at java.base/java.util.ServiceLoader$2.next(ServiceLoader.java:1256)
	at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:3525)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3562)
	at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:77)
	at org.apache.spark.sql.internal.SharedState$.liftedTree2$1(SharedState.scala:209)
	at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:208)
	at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:56)

Comment thread data-gen.py
import random
from datetime import datetime, timedelta

from consts.constants import Constants
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Under Review

Development

Successfully merging this pull request may close these issues.

2 participants