Python3 pipe I/O on np.ndarray with raw binary data failed

I have a binary raw data file in.dat storing 4 int32 values.

$ xxd in.dat 
00000000: 0100 0000 0200 0000 0300 0000 0400 0000  ................

I want to read them into np.ndarray, multiply by 2, then write them out to stdout with the same raw binary format as in.dat. The expected output is like,

$ xxd out.dat 
00000000: 0200 0000 0400 0000 0600 0000 0800 0000  ................

The code is like this,

#!/usr/bin/env python3

import sys
import numpy as np

if __name__ == '__main__':
    y = np.fromfile(sys.stdin, dtype='int32')
    y *= 2
    sys.stdout.buffer.write(y.astype('int32').tobytes())
    exit(0)

I find it works as expected with <,

$ python3 test.py <in.dat >out.dat

But it does not work with a pipe |. Here comes the error message.

$ cat in.dat | python3 test.py >out.dat
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    y = np.fromfile(sys.stdin, dtype='int32')
OSError: obtaining file position failed

What do I miss here?

2 answers

  • answered 2018-07-11 03:48 Bailey Parker

    This is because when redirecting a file in, stdin is seekable (because it isn't a TTY or pipe, for example, it's just a file that's been given FD 1). Try invoking the following script with cat foo.txt | python3 test.py vs python3 test.py <foo.txt (assuming foo.txt contains some text):

    import sys
    
    sys.stdin.seek(1)
    print(sys.stdin.read())
    

    The former will error with:

    Traceback (most recent call last):
      File "test.py", line 3, in <module>
        sys.stdin.seek(1)
    io.UnsupportedOperation: underlying stream is not seekable
    

    That said, numpy is way overkill for what you're trying to do here. You can easily achieve this with a few lines and struct:

    import struct
    import sys
    
    FORMAT = '@i'
    
    
    def main():
        try:
            while True:
                num = struct.unpack(FORMAT, sys.stdin.buffer.read(struct.calcsize(FORMAT)))
                sys.stdout.buffer.write(struct.pack(FORMAT, num * 2))
        except EOFError:
            pass
    
    if __name__ == '__main__':
        main()
    

    Edit: there's also no need for sys.exit(0). This is the default.

  • answered 2018-07-11 03:51 juanpa.arrivillaga

    If you use np.frombuffer, it should work both ways:

    pipebytes.py

    import numpy as np
    import sys
    print(np.frombuffer(sys.stdin.buffer.read(), dtype=np.int32))
    

    Now,

    Juans-MacBook-Pro:temp juan$ xxd testdata.dat
    00000000: 0100 0000 0200 0000 0300 0000            ............
    Juans-MacBook-Pro:temp juan$ python pipebytes.py < testdata.dat
    [1 2 3]
    Juans-MacBook-Pro:temp juan$ cat testdata.dat | python pipebytes.py
    [1 2 3]
    Juans-MacBook-Pro:temp juan$
    

    Although, I suspect this will make a copy of the data.