Код: Выделить всё
import os.path, io
filename = ""
n_chunks = 12 # Number of processes to use -- will split the file up into this many pieces
def find_newline_pos(f,n):
f.seek(n)
c = f.read(1)
while c != '\n' and n > 0:
n-=1
f.seek(n)
c = f.read(1)
return(n)
def prestart():
fsize = os.path.getsize(filename)
pieces = [] # Holds start and stop position of each chunk
initial_chunks=list(range(0,fsize,int(fsize/n_chunks)))[:-1]
f = io.open(filename,'rb')
pieces = sorted(set([find_newline_pos(f,n) for n in initial_chunks]))
pieces.append(fsize)
args = zip([x+1 if x > 0 else x for x in pieces],[x for x in pieces[1:]])
return(args)
args = prestart()
Код: Выделить всё
| Part | Purpose |
| ------------------ | --------------------------------------------------------------------------------------|
| `find_newline_pos` | Moves backward from a byte offset to find the previous newline (`\n`). |
| `prestart` | Splits the file into roughly equal chunks that align with newline positions. |
| `args` | The list of `(start, end)` byte positions for each chunk — ready for multiprocessing. |
Есть ли более эффективный метод определения позиций байтов для разделения файла ndjson?>
Подробнее здесь: https://stackoverflow.com/questions/798 ... nto-chunks
Мобильная версия