Чтение PDF-файла с помощью записных книжек Azure Synapse ⇐ Python
Чтение PDF-файла с помощью записных книжек Azure Synapse
It's my first post, asking for a help, before I usually used examples from Stack overflow, but can't find and answer. I am sorry, if the formatting of my post is not great, will try to improve it for the future.
I am struggling with reading PDF files from Azure Date Lake Gen 2 with Azure Synapse Notebooks.
Reading CSV file is not problem, I can access CSV with command:
%%pyspark df = spark.read.load('abfss://**accountname**.dfs.core.windows.net/**file.csv**' ## If header exists uncomment line below ##, header=True ) display(df.limit(10)) But when I tried to read PDF, it's always failing. I used libraries like pypdf2 and camelot.
pdf_file = "abfss://**accountname**.dfs.core.windows.net/**file.pdf" # Open the PDF using PyPDF2 pdf_reader = PyPDF2.PdfReader(pdf_file) I receive an error:
FileNotFoundError: [Errno 2] No such file or directory
I tried to mount storage location as mentioned in this post - How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?
Still can read CSV file from that mounted storage, but not PDF.
mssparkutils.fs.mount( "abfss://container@accountname.dfs.core.windows.net/", "/TR", {"LinkedService":"linkedservice"} ) # can get a path, this command is working: path = mssparkutils.fs.getMountPath("TR") print(path) import PyPDF2 with open("/synfs/mount#/TR/file.pdf") as f: pdf_reader = PyPDF2.PdfReader(f) Gives an error:
OSError: [Errno 5] Input/output error:
I tried to read using path, still not working.
file_name = path + "/file.pdf" print(file_name) reader = PyPDF2.PdfReader(open(file_name, 'rb')) gives an error: OSError: [Errno 5] Input/output error
Tried to use PyPDF2:
pdf_reader = PyPDF2.PdfReader(file_name) Gives an error:
logger_warning( 310 "PdfReader stream/file object is not in binary mode. " 311 "It may not be read correctly.", 312 name,'
Please advice, if you know how to solve it. I am using Azure Synapse Studio, not SDK.
Источник: https://stackoverflow.com/questions/780 ... -notebooks
It's my first post, asking for a help, before I usually used examples from Stack overflow, but can't find and answer. I am sorry, if the formatting of my post is not great, will try to improve it for the future.
I am struggling with reading PDF files from Azure Date Lake Gen 2 with Azure Synapse Notebooks.
Reading CSV file is not problem, I can access CSV with command:
%%pyspark df = spark.read.load('abfss://**accountname**.dfs.core.windows.net/**file.csv**' ## If header exists uncomment line below ##, header=True ) display(df.limit(10)) But when I tried to read PDF, it's always failing. I used libraries like pypdf2 and camelot.
pdf_file = "abfss://**accountname**.dfs.core.windows.net/**file.pdf" # Open the PDF using PyPDF2 pdf_reader = PyPDF2.PdfReader(pdf_file) I receive an error:
FileNotFoundError: [Errno 2] No such file or directory
I tried to mount storage location as mentioned in this post - How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?
Still can read CSV file from that mounted storage, but not PDF.
mssparkutils.fs.mount( "abfss://container@accountname.dfs.core.windows.net/", "/TR", {"LinkedService":"linkedservice"} ) # can get a path, this command is working: path = mssparkutils.fs.getMountPath("TR") print(path) import PyPDF2 with open("/synfs/mount#/TR/file.pdf") as f: pdf_reader = PyPDF2.PdfReader(f) Gives an error:
OSError: [Errno 5] Input/output error:
I tried to read using path, still not working.
file_name = path + "/file.pdf" print(file_name) reader = PyPDF2.PdfReader(open(file_name, 'rb')) gives an error: OSError: [Errno 5] Input/output error
Tried to use PyPDF2:
pdf_reader = PyPDF2.PdfReader(file_name) Gives an error:
logger_warning( 310 "PdfReader stream/file object is not in binary mode. " 311 "It may not be read correctly.", 312 name,'
Please advice, if you know how to solve it. I am using Azure Synapse Studio, not SDK.
Источник: https://stackoverflow.com/questions/780 ... -notebooks
-
- Похожие темы
- Ответы
- Просмотры
- Последнее сообщение
-
-
Как установить библиотеки Python в ноутбуке Synapse Synapse Cluster (Pyspark)
Anonymous » » в форуме Python - 0 Ответы
- 10 Просмотры
-
Последнее сообщение Anonymous
-