Parsing Large JSON with NodeJS

The Parallels of Code and Commonplace

June 21, 2019

Keys to Workplace Wellness

August 20, 2019

Published by Curt Gratz at July 2, 2019

Tags

Recently I was tasked with parsing a very large JSON file with Node.js Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.

const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);

Or even simpler with a require statement like this

const data = require('./file.json');

Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.

Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.

With stream-json, we can use the NodeJS file stream to process our large data file in chucks.

const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');

const jsonStream = StreamArray.withParser();

//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);

//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
    console.log(key, value);
});

jsonStream.on('end', ({key, value}) => {
    console.log('All Done');
});

Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.

To solve this I had to also use a custom Writeable stream like this.

const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');

const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
	write({key, value}, encoding, callback) {

		//some async operations
		setTimeout(() => {
			console.log(key,value);
			//Runs one at a time, need to use a callback for that part to work
			callback();
		}, 1000);
	},
	//Don't skip this, as we need to operate with objects, not buffers
	objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));

The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.

This stack overflow is where I got the examples for this post.

http://stackoverflow.com/questions/42896447/parse-large-json-file-in-nodejs-and-handle-each-object-independently/42897498

Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.

node --max-old-space-size=4096 file.js

By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

Curt Gratz

10 Comments

Alen says:

April 8, 2020 at 2:35 am

Hello,
why i get path is not defined?

Reply
- ckh says:
  
  April 9, 2020 at 8:11 am
  
  Can you send a screenshot of the error you’re getting?
  
  Reply
shiv says:

May 19, 2020 at 12:25 am

I too am getting path is not defined

Reply
Curt Gratz says:

May 28, 2020 at 7:44 am

Sorry, I didn’t include the path in my includes, if you just replace

const fileStream = fs.createReadStream(path.join( ‘file.json’));

with

const fileStream = fs.createReadStream(‘file.json’);

And put in the full path to the file, it should work as expected. I will update the post for this.

Thanks,
Curt

Reply
nashid says:

June 1, 2020 at 1:40 am

there is no difference between the two code snippet, am I missing something?

Reply
Curt Gratz says:

June 1, 2020 at 5:35 am

There is a difference, just a small one. I removed the path.join() This isn’t needed unless you are combining segments of a path.

https://www.w3schools.com/nodejs/met_path_join.asp

Sorry for the confusion.

Curt

Reply
jg says:

March 23, 2021 at 11:59 am

Curt – thanks for this tutorial. What do you think of Algolia’s version of just using in-built stream.pause(), stream.resume() methods, no custom Writeable stream.

Check out their “Using the API” and Javascript code:
https://www.algolia.com/doc/guides/sending-and-managing-data/send-and-update-your-data/how-to/sending-records-in-batches/?client=javascript

Would this suffer from memory consumption or is it doing the same thing as you are?

Reply
- Abdul Saleem says:
  
  May 10, 2021 at 2:13 pm
  
  This comment really, really, really helped me a lottt. Thanks a ton jg
  
  Reply
Curt Gratz says:

March 24, 2021 at 7:45 am

I have never used the Algolia’s stream, but took a quick look and it does look like it would do the job. I assume they have memory management figured out.

Reply
Abdul Saleem says:

May 9, 2021 at 5:12 pm

Thank you so much.. I was left all alone with this libary having no great help. Your post saved my life.

Reply

Parsing Large JSON with NodeJS

The Parallels of Code and Commonplace

Keys to Workplace Wellness

Curt Gratz

Related posts

Unleashing the Power of Sagas

Discord.js: An Introduction to the Powerful JavaScript Library for Discord Bots

“Stupid” answers matter

10 Comments

Leave a Reply Cancel reply