How does PostgreSQL handle get() in Django for exact file path matches and its performance?
I’m using Django with a PostgreSQL database, and I have a model that stores file paths (specifically S3 file paths) in a CharField, like this:
class File(models.Model):
file_path = models.CharField(max_length=255)
I often retrieve a single file record using Django’s get() method with an exact S3 file path match:
file = File.objects.get(file_path='/bucket/path/to/file.txt')
This translates to the following SQL query:
SELECT * FROM my_table WHERE file_path = '/bucket/path/to/file.txt' LIMIT 1;
I want to understand what happens under the hood when PostgreSQL processes this query. Specifically:
1. How does PostgreSQL handle the comparison of strings when using the = operator on a CharField with an S3 file path?
2. Does PostgreSQL behave differently if there’s an index on the file_path column, and how does it decide between using an index or performing a sequential scan?
3. Is it beneficial to create an index on the file_path field for better performance,
If you know how the ORM works in Django you will find that it just transform the Model syntax notation to a plain text SQL
for example:
File.objects.get(file_path='/bucket/path/to/file.txt')
It will get transformed to something like the following:
SELECT * FROM "appname_file" WHERE "file_path" = '/bucket/path/to/file.txt' LIMIT 1;
In this case if file_path
is not an index the SQL engine performs the Execution Plan
and it will do an operation called: FULL SCAN
which means that it'll for sure scan all the rows in the table.
If you decide to create the index file_path
it will do something really different because it stores something called Data pages or Leaf nodes
which contains a mapping of where a subset of records are allocated as a result it will query the data_pages
and from there it will get the unique identifier for that row and won't perform the FULL SCAN
.
Considerations:
- For your use case if you don't have many indexes I think is a good case to create one.
- You can evaluate the performance generating the plan using the
EXPLAIN
keyword in sql and you can analyze results. - If you want to go deeper into it you can use
EXPLAIN ANALYZE
it will give you not only the execution plan but the execution time as well. - You can use Django to get the execution plan as following:
from django.db import connection
sql = 'EXPLAIN ANALYZE SELECT * FROM appname_file WHERE file_path=%s LIMIT 1;'
params = ['/bucket/path/to/file.txt']
with connection.cursor() as cursor:
cursor.execute(sql, params):
plan = cursor.fetchall()
print(plan)
I encourage you to run your own performance evaluations and see if index works well for your use case, I guess it will specially if the appname_file
is unique for each row.