Error Running Spark Job from Django API using subprocess.Popen
I have created a Django project executable, and I need to run a Spark job from an API endpoint within this executable. I am using subprocess.Popen to execute the spark-submit command, but I am encountering an error when the command is executed.
Here’s the command I am trying to run:
/opt/spark-3.5.5-bin-hadoop3/bin/spark-submit --master local --deploy-mode client --conf "spark.ui.enabled=false" --conf "spark.ui.showConsoleProgress=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.rdd.compress=false" --conf "spark.driver.memory=4g" --conf "spark.executor.memory=8g" --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" /Users/user1/Project/process-engine/route.py > /app/server/EAP/rasLite/engine/raslight/spark_submit_log/20250424/772_1.0_174702842893.log 2>&1 "{'processNo': '772', 'versionNo': '1.0', 'jsonData': '', 'executionDate': '', 'skipError': 'N', 'generated_executionid': '149897', 'isExecutionIdGenerated': 'True', 'executionId': '149897', 'isPreProcess': 'False'}" &
However, I am getting the following error in the logs:
Unknown command: 'C:/Users/user1/Project/process-engine/route.py'
Type 'ras.exe help' for usage.
Context:
- I am running this command from a Django API endpoint within a Django project executable.
- The path to route.py seems to be correct, but the error message indicates a Windows-style path (C:/Users/...) instead of the Unix-style path I am using (/Users/...).
- I am using the following code to execute the command:
command = f'/Users/user1/Project/process-engine/spark-3.5.5-bin-hadoop3/bin/spark-submit --master local --deploy-mode client --conf "spark.ui.enabled=false" --conf "spark.ui.showConsoleProgress=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.rdd.compress=false" --conf "spark.driver.memory=4g" --conf "spark.executor.memory=8g" --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" /Users/user1/Project/process-engine/route.py > {finalFilePath} 2>&1'
final_command = command + " " + '\"' + raw_body_decoded1 + '\"' + " " + "&"
# Set the environment variables for the subprocess
env = os.environ.copy()
env['DJANGO_SETTINGS_MODULE'] = 'rasLightEngine.settings'
env['SPARK_HOME'] = 'C:/Users/user1/Project/process-engine/spark-3.5.5-bin-hadoop3'
subprocess.Popen(f'{final_command}', shell=True, env=env)
I have try setting up the environment variable for SPARK_HOME but it still showing the same error.
It seems like your django code has its own logic that is being misinterpreted. Try the following
# 1. Make sure paths are consistent with your OS
# For Windows:
spark_home = 'C:/Users/user1/Project/process-engine/spark-3.5.5-bin-hadoop3'
route_path = 'C:/Users/user1/Project/process-engine/route.py'
# For Unix/Linux/Mac:
# spark_home = '/Users/user1/Project/process-engine/spark-3.5.5-bin-hadoop3'
# route_path = '/Users/user1/Project/process-engine/route.py'
# 2. Use os.path.join for paths to ensure compatibility
spark_submit = os.path.join(spark_home, 'bin', 'spark-submit')
# 3. Build the command with proper quoting
command = [
spark_submit,
"--master", "local",
"--deploy-mode", "client",
"--conf", "spark.ui.enabled=false",
"--conf", "spark.ui.showConsoleProgress=false",
"--conf", "spark.dynamicAllocation.enabled=false",
"--conf", "spark.rdd.compress=false",
"--conf", "spark.driver.memory=4g",
"--conf", "spark.executor.memory=8g",
"--conf", "spark.serializer=org.apache.spark.serializer.KryoSerializer",
route_path
]
# 4. Set up environment
env = os.environ.copy()
env['DJANGO_SETTINGS_MODULE'] = 'rasLightEngine.settings'
env['SPARK_HOME'] = spark_home
# 5. Execute the command - using args as a list avoids shell parsing issues
process = subprocess.Popen(
command + [raw_body_decoded1],
stdout=open(finalFilePath, 'w'),
stderr=subprocess.STDOUT,
env=env
)