How does PostgreSQL handle get() in Django for exact file path matches and its performance?

I’m using Django with a PostgreSQL database, and I have a model that stores file paths (specifically S3 file paths) in a CharField, like this:

class File(models.Model):
    file_path = models.CharField(max_length=255)

I often retrieve a single file record using Django’s get() method with an exact S3 file path match:

file = File.objects.get(file_path='/bucket/path/to/file.txt')

This translates to the following SQL query:

SELECT * FROM my_table WHERE file_path = '/bucket/path/to/file.txt' LIMIT 1;

I want to understand what happens under the hood when PostgreSQL processes this query. Specifically:

1.  How does PostgreSQL handle the comparison of strings when using the = operator on a CharField with an S3 file path?
2.  Does PostgreSQL behave differently if there’s an index on the file_path column, and how does it decide between using an index or performing a sequential scan?
3.  Is it beneficial to create an index on the file_path field for better performance,

If you know how the ORM works in Django you will find that it just transform the Model syntax notation to a plain text SQL for example:

File.objects.get(file_path='/bucket/path/to/file.txt')

It will get transformed to something like the following:

SELECT * FROM "appname_file" WHERE "file_path" = '/bucket/path/to/file.txt' LIMIT 1;

In this case if file_path is not an index the SQL engine performs the Execution Plan and it will do an operation called: FULL SCAN which means that it'll for sure scan all the rows in the table.

If you decide to create the index file_path it will do something really different because it stores something called Data pages or Leaf nodes which contains a mapping of where a subset of records are allocated as a result it will query the data_pages and from there it will get the unique identifier for that row and won't perform the FULL SCAN.

Considerations:

  • For your use case if you don't have many indexes I think is a good case to create one.
  • You can evaluate the performance generating the plan using the EXPLAIN keyword in sql and you can analyze results.
  • If you want to go deeper into it you can use EXPLAIN ANALYZE it will give you not only the execution plan but the execution time as well.
  • You can use Django to get the execution plan as following:
from django.db import connection
sql = 'EXPLAIN ANALYZE SELECT * FROM appname_file WHERE file_path=%s LIMIT 1;'
params = ['/bucket/path/to/file.txt']
with connection.cursor() as cursor:
    cursor.execute(sql, params):
    plan = cursor.fetchall()

print(plan)

I encourage you to run your own performance evaluations and see if index works well for your use case, I guess it will specially if the appname_file is unique for each row.

Вернуться на верх