The Parallels of Code and Commonplace
June 21, 2019Keys to Workplace Wellness
August 20, 2019Recently I was tasked with parsing a very large JSON file with Node.js Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.
10 Comments
Hello,
why i get path is not defined?
Can you send a screenshot of the error you’re getting?
I too am getting path is not defined
Sorry, I didn’t include the path in my includes, if you just replace
const fileStream = fs.createReadStream(path.join( ‘file.json’));
with
const fileStream = fs.createReadStream(‘file.json’);
And put in the full path to the file, it should work as expected. I will update the post for this.
Thanks,
Curt
there is no difference between the two code snippet, am I missing something?
There is a difference, just a small one. I removed the path.join() This isn’t needed unless you are combining segments of a path.
https://www.w3schools.com/nodejs/met_path_join.asp
Sorry for the confusion.
Curt
Curt – thanks for this tutorial. What do you think of Algolia’s version of just using in-built stream.pause(), stream.resume() methods, no custom Writeable stream.
Check out their “Using the API” and Javascript code:
https://www.algolia.com/doc/guides/sending-and-managing-data/send-and-update-your-data/how-to/sending-records-in-batches/?client=javascript
Would this suffer from memory consumption or is it doing the same thing as you are?
This comment really, really, really helped me a lottt. Thanks a ton jg
I have never used the Algolia’s stream, but took a quick look and it does look like it would do the job. I assume they have memory management figured out.
Thank you so much.. I was left all alone with this libary having no great help. Your post saved my life.