常规的读取大文件的步骤1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import pandas as pdf = open ('./data/ows-raw.txt' ,encoding='utf-8' ) reader = pd.read_table(f, sep=',' , iterator=True , error_bad_lines=False ) loop = True chunkSize = 100000 chunks = [] while loop: try : chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: loop = False print("Iteration is stopped." ) df = pd.concat(chunks, ignore_index=True )
STORY这几天有一个需求是读取.dta文件并转为.csv,google了一下发现pandas也是支持dta格式的
于是直接开写,20行搞定
然而事情并没有那么简单…
read_stata方法就直接抛出ValueError了:
又Google了一下,github issues上没有解决了的,stackoverflow里倒是有提议,但貌似不是抛出这个error
解决无奈还是自己去读源码了,发现StataReader的get_chunk方法貌似在不给出chunksize时不能默认读取全部,无奈只能采用了下面的方法二分chunksize直到读取完毕:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import pandas as pdimport osimport retarget_path = './data/excel/{}.csv' def dta_to_excel (origin_path ): CHUNKSIZE = 2000 reader = pd.read_stata(origin_path, iterator=True ) file_name = re.sub(r'\.dta' , '' , origin_path.split('/' )[-1 ]) print('{} translate start' .format (file_name)) chunks = [] while CHUNKSIZE > 0 : try : print('Will get {} lines' .format (CHUNKSIZE)) chunk = reader.get_chunk(CHUNKSIZE) chunks.append(chunk) except ValueError: print('CHUNKSIZE too large' ) CHUNKSIZE //= 2 df = pd.concat(chunks, ignore_index=True ) df.to_csv(target_path.format (file_name)) print('{} translated done' .format (file_name)) if __name__ == '__main__' : origin_dir = './data/origin' for path in os.listdir(origin_dir): dta_to_excel(os.path.join(origin_dir, path))
总算是能正常输出了…