For files uploaded via Django, how can I use them with libraries that require a hard file path to the file?

I am currently working on a text extraction component of my Django backend, which intends to extract text from urls (working), pure text (working), and files (.pdf, .doc, .ppt, .md, .txt, .html).

My current code works for hardcoded file paths to valid file inputs:

def extract_from_file(uploaded_file):

    file_type = os.path.splitext(uploaded_file.name)[1].lower()

    if file_type == ".pdf":
        text = pdf_to_text(uploaded_file)
    
    elif file_type in [".doc", ".docx", ".docm", ".dot", ".dotx", ".dotm"]:
        text = doc_to_text(uploaded_file)
    
    elif file_type in [".ppt", ".pptx", ".pps", ".ppsx"]:
        text = ppt_to_text(uploaded_file)
    
    elif file_type in [".md", ".html", ".htm"]:
        text = html_to_text(uploaded_file, file_type)

    elif file_type == ".txt":
        # adapted from https://www.geeksforgeeks.org/pandas/read-html-file-in-python-using-pandas/
        with open(uploaded_file, "r", encoding="utf-8") as f:
            text = f.read()

    else:
        raise ValueError("Unsupported file type: " + file_type)

    if text:
        return article_from_text(text)
    else:
        raise ValueError("No text could be extracted from the file.")

def pdf_to_text(file):

    reader = PdfReader(file.file)
    return "".join([page.extract_text() for page in reader.pages])

def doc_to_text(file):

    document = Document()
    document.LoadFromFile(file)

    text = document.GetText()

    document.Close()

    return text

def ppt_to_text(file):

    presentation = Presentation()
    presentation.LoadFromFile(file)

    sb = []

    # Loop through all slides and extract test to sb list - O(n^3) - maybe better way to do later? - quite slow
    # based on https://github.com/eiceblue/Spire.Presentation-for-Python/blob/main/Python%20Examples/02_ParagraphAndText/ExtractText.py
    for slide in presentation.Slides:
        for shape in slide.Shapes:
            if isinstance(shape, IAutoShape):
                for tp in ( shape if isinstance(shape, IAutoShape) else None).TextFrame.Paragraphs:
                    sb.append (tp.Text)
    
    text = "\n".join(sb)
    presentation.Dispose() # Releases all resources used by presentation object

    return text

def html_to_text(file, type):

    with open(file, "r", encoding="utf-8") as f:
        file_content = f.read()

    # Convert markdown to html if needed
    if type == ".md":
        file_content = markdown(file_content)

    # from https://gist.github.com/lorey/eb15a7f3338f959a78cc3661fbc255fe
    soup = BeautifulSoup(file_content, "html.parser")
    return "\n".join(soup.find_all(string=True))

Currently, uploading files only works for pdfs, because that library is able to handle djangos file object, but the Spire.docx, Spire.pptx and manually opening the html, md and txt files is unable to do so because it throws this error:

"plum.function.NotFoundLookupError: For function "LoadFromFile" of spire.doc.interface.IDocument.IDocument, signature Signature(spire.doc.Document.Document, django.core.files.uploadedfile.InMemoryUploadedFile) could not be resolved."

I do not want to download the file locally for it to work (which i have tried already). I have tried to create a temporary file inside the code using

from django.core.files.temp import NamedTemporaryFile

temp_file = NamedTemporaryFile(delete=True)

and then using this to pass temp_file.name into my extraction functions but it did not work.

I haven't been able to find anything anywhere in relation to this.

Вернуться на верх

Последние вопросы и ответы

Implementing HTMX in a Django app: Should I use two templates per view?

Django vs FastAPI for building a Retrieval-Augmented Generation (RAG) system [closed]

Best practice for managing magic strings in Django JsonResponse keys/values

How do you handle POST requests in Django

How to fix ModuleNotFoundError in Python? [duplicate]

Give customized Context to own templates for predefined Django Reset Password Classes

How to make Django Models store `DateTimeField` without microseconds?

Getting TemplateDoesNotExist for an included template that already exists

Django version upgrade while using Django River 3.3.0

Cannot Read Data from Nested Serializer in Django REST Framework

For files uploaded via Django, how can I use them with libraries that require a hard file path to the file?

Последние вопросы и ответы

Рекомендуемые записи по теме