Diving into Node.js Streams: A Beginner's Guide

What are Node.JS Streams?

Processing large datasets can be a time taking and a resource consuming task and can make the application slow if done by the traditional method of loading the entire dataset at once in the memory.
Streams in NodeJS basically processes data in chunks such that it does not affect the flow of the application.
The basic principle behind working of streams in node.js is processing data in small chunks .

Streams inherit from the EventEmitter class to emit events at various stages.

Event-Centric Architecture

Node.js streams have an event driven architecture making them ideal for real time I/O data processing.

Example:

If we have a server generating log files and we wish to monitor the logs in real time to catch errors as they occur , we can achieve this easily with the help of streams in Node.JS :

Step-1: Creating a read stream using fs.createReadStream()

Step-2: Setting up readLine() property to process the log file line-by-line.

Step-3: Listen for events for each new line as it is added in the log files.

Advantages of Streams

Memory efficient: Handling data in smaller segments instead of loading it entirely into memory at once enhances memory efficiency.
Improved Response Time: The data is processed as soon as the chunk of data arrives and the response is sent back to the user thereby improving the response time.
Improved Scalability: The ability to handle large datasets increases the scalability of the applications when streams is put to use.

Handling the flow of data: Backpressure

Backpressure is the mechanism in streams that controls the flow of data from the data source to the data consumer .

An easy analogy to understand backpressure:

Imagine you have a garden hose connected to a water faucet , the water flowing through the faucet represents the data in node.js:
Water Faucet (Data Source): Just like the water faucet controls the flow of water , the data source controls the flow of data.
Garden Hose(Stream): The hose carries the water from the faucet to the garden , in a similar way the stream carries the chunk of data from the data source to the destination.
Garden (Data Destination): The water reaches the the garden , in the same way the data reaches the destination.

Now let's say the flow of water through the hose is uncontrolled it gets uncontrolled resulting in the overflow , this is similar to what happens in case of backpressure in node.js.

Normal Flow: When the faucet (data source) and the outlet hose (data destination) are in sync water(data) flow is smooth.
Excess Flow: If the faucet releases more water than the garden can absorb, the hose will start to overflow. Similarly, if the data source sends more data than the destination can process, it creates backpressure, causing the stream to slow down or buffer the excess data to prevent overload.

Importance of Backpressure:

Data Loss: If the data arrives faster than the consumer can handle , some part of the data chunk might get deleted / discarded leading to data loss.
Performance Issues: If the data flow is more than the consumer's capability to handle it then it may lead to slow data processing , more memory consumption and frequent system crashes thereby affecting the performance.

Managing Backpressure:

Considering the garden hose analogy , one may control the flow of water by adjusting the faucet , similarly in case of streams handle it by controlling the flow of data until the data destination is in sync.

The main goal of backpressure is to control the flow of data from the source to destination. This is achieved through a controlled buffering system where the stream temporarily holds the data until the consumer is ready to process more. In streams the backpressure is controlled with the help of an internal property called highWaterMark() , it specifies the number of bytes that can be internally buffered .

Buffering and Flow control:

Following steps defines how buffering internally works:

Data Production: When a producer (like a readable stream) generates data, it fills the internal buffer up to the limit set by highWaterMark()
Buffering: If the internal buffer reaches the highWaterMark() the stream will temporarily stop reading from the source or accepting more data until some buffers are consumed.
Data Consumption: The consumer (like a writable stream) processes data from the buffer. As the buffer level decreases and goes below the highWaterMark() the stream resumes reading or accepting data, replenishing the buffer.

This backpressure mechanism ensures that the data flowing through the pipeline is regulated, preventing the consumer from being overwhelmed and ensuring smooth, efficient and reliable data processing.

In the next blog i will be adding a practical implementation of streams in node.js , thanks for reading :)