ETL pipeline implemented using Node.js Streams to ensure a memory-efficient data processing flow, particularly when dealing with large datasets.
The pipeline's main goal is to gather data from diverse sources, transform it into a consistent format, and then load it into a MongoDB collection.
Step 1: Data is extracted from four distinct sources, including two APIs, a JSON file, and a CSV file. JSONStream and csv-parser libraries are employed to parse and transform JSON and CSV data.
Step 2: The extracted data streams are then transformed using a custom transform stream. The data transformation logic is applied to each chunk of data as it flows through the stream.
Step 3: In the loading phase, the transformed data is directly loaded into a MongoDB collection using the initializeUnorderedBulkOp method.
Designed a chunked streaming API for MongoDB datasets, enabling efficient handling of large-scale data and reducing frontend load bottlenecks by approximately 40%