Вычисление позиций байтов для разделения файлов ndjson на фрагменты

Вычисление позиций байтов для разделения файлов ndjson на фрагменты ⇐ Python

Ответить

1 сообщение • Страница 1 из 1

Anonymous

Вычисление позиций байтов для разделения файлов ndjson на фрагменты

Цитата

Сообщение Anonymous » 11 ноя 2025, 12:32

Ниже приведен код, извлеченный из этого репозитория:

Код: Выделить всё

import os.path, io
filename = ""

n_chunks = 12  # Number of processes to use -- will split the file up into this many pieces

def find_newline_pos(f,n):
f.seek(n)
c = f.read(1)
while c != '\n' and n > 0:
n-=1
f.seek(n)
c = f.read(1)
return(n)

def prestart():
fsize = os.path.getsize(filename)
pieces = []   # Holds start and stop position of each chunk
initial_chunks=list(range(0,fsize,int(fsize/n_chunks)))[:-1]
f = io.open(filename,'rb')
pieces = sorted(set([find_newline_pos(f,n) for n in initial_chunks]))
pieces.append(fsize)
args = zip([x+1 if x > 0 else x for x in pieces],[x for x in pieces[1:]])
return(args)

args = prestart()

Цель приведенного выше фрагмента:

Код: Выделить всё

| Part               | Purpose                                                                               |
| ------------------ | --------------------------------------------------------------------------------------|
| `find_newline_pos` | Moves backward from a byte offset to find the previous newline (`\n`).                |
| `prestart`         | Splits the file into roughly equal chunks that align with newline positions.          |
| `args`             | The list of `(start, end)` byte positions for each chunk — ready for multiprocessing. |

Для файла размером 2 ГБ приведенный выше код выполняется более 10 минут.
Есть ли более эффективный метод определения позиций байтов для разделения файла ndjson?>

Подробнее здесь: https://stackoverflow.com/questions/798 ... nto-chunks

1762853531

Anonymous

Ниже приведен код, извлеченный из этого репозитория:
[code]import os.path, io
filename = ""

n_chunks = 12  # Number of processes to use -- will split the file up into this many pieces

def find_newline_pos(f,n):
f.seek(n)
c = f.read(1)
while c != '\n' and n > 0:
n-=1
f.seek(n)
c = f.read(1)
return(n)

def prestart():
fsize = os.path.getsize(filename)
pieces = []   # Holds start and stop position of each chunk
initial_chunks=list(range(0,fsize,int(fsize/n_chunks)))[:-1]
f = io.open(filename,'rb')
pieces = sorted(set([find_newline_pos(f,n) for n in initial_chunks]))
pieces.append(fsize)
args = zip([x+1 if x > 0 else x for x in pieces],[x for x in pieces[1:]])
return(args)

args = prestart()
[/code]
Цель приведенного выше фрагмента:
[code]| Part               | Purpose                                                                               |
| ------------------ | --------------------------------------------------------------------------------------|
| `find_newline_pos` | Moves backward from a byte offset to find the previous newline (`\n`).                |
| `prestart`         | Splits the file into roughly equal chunks that align with newline positions.          |
| `args`             | The list of `(start, end)` byte positions for each chunk — ready for multiprocessing. |
[/code]
Для файла размером 2 ГБ приведенный выше код выполняется более 10 минут.
Есть ли более эффективный метод определения позиций байтов для разделения файла ndjson?> 

Подробнее здесь: [url]https://stackoverflow.com/questions/79808277/calculate-byte-positions-to-split-ndjson-files-into-chunks[/url]