For files uploaded via Django, how can I use them with libraries that require a hard file path to the file?

I am currently working on a text extraction component of my Django backend, which intends to extract text from urls (working), pure text (working), and files (.pdf, .doc, .ppt, .md, .txt, .html).

My current code works for hardcoded file paths to valid file inputs:

def extract_from_file(uploaded_file):

    file_type = os.path.splitext(uploaded_file.name)[1].lower()

    if file_type == ".pdf":
        text = pdf_to_text(uploaded_file)
    
    elif file_type in [".doc", ".docx", ".docm", ".dot", ".dotx", ".dotm"]:
        text = doc_to_text(uploaded_file)
    
    elif file_type in [".ppt", ".pptx", ".pps", ".ppsx"]:
        text = ppt_to_text(uploaded_file)
    
    elif file_type in [".md", ".html", ".htm"]:
        text = html_to_text(uploaded_file, file_type)

    elif file_type == ".txt":
        # adapted from https://www.geeksforgeeks.org/pandas/read-html-file-in-python-using-pandas/
        with open(uploaded_file, "r", encoding="utf-8") as f:
            text = f.read()

    else:
        raise ValueError("Unsupported file type: " + file_type)

    if text:
        return article_from_text(text)
    else:
        raise ValueError("No text could be extracted from the file.")
def pdf_to_text(file):

    reader = PdfReader(file.file)
    return "".join([page.extract_text() for page in reader.pages])

def doc_to_text(file):

    document = Document()
    document.LoadFromFile(file)

    text = document.GetText()

    document.Close()

    return text

def ppt_to_text(file):

    presentation = Presentation()
    presentation.LoadFromFile(file)

    sb = []

    # Loop through all slides and extract test to sb list - O(n^3) - maybe better way to do later? - quite slow
    # based on https://github.com/eiceblue/Spire.Presentation-for-Python/blob/main/Python%20Examples/02_ParagraphAndText/ExtractText.py
    for slide in presentation.Slides:
        for shape in slide.Shapes:
            if isinstance(shape, IAutoShape):
                for tp in ( shape if isinstance(shape, IAutoShape) else None).TextFrame.Paragraphs:
                    sb.append (tp.Text)
    
    text = "\n".join(sb)
    presentation.Dispose() # Releases all resources used by presentation object

    return text

def html_to_text(file, type):

    with open(file, "r", encoding="utf-8") as f:
        file_content = f.read()

    # Convert markdown to html if needed
    if type == ".md":
        file_content = markdown(file_content)

    # from https://gist.github.com/lorey/eb15a7f3338f959a78cc3661fbc255fe
    soup = BeautifulSoup(file_content, "html.parser")
    return "\n".join(soup.find_all(string=True))

Currently, uploading files only works for pdfs, because that library is able to handle djangos file object, but the Spire.docx, Spire.pptx and manually opening the html, md and txt files is unable to do so because it throws this error:

"plum.function.NotFoundLookupError: For function "LoadFromFile" of spire.doc.interface.IDocument.IDocument, signature Signature(spire.doc.Document.Document, django.core.files.uploadedfile.InMemoryUploadedFile) could not be resolved."

I do not want to download the file locally for it to work (which i have tried already). I have tried to create a temporary file inside the code using

from django.core.files.temp import NamedTemporaryFile

temp_file = NamedTemporaryFile(delete=True)

and then using this to pass temp_file.name into my extraction functions but it did not work.

I haven't been able to find anything anywhere in relation to this.

Вернуться на верх