Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

DataOps applies Agile, DevOps, and statistical process control to data, aiming to improve the release and quality of data products by reducing time to value. It’s a cultural shift emphasizing collaboration, continuous learning, and rapid iteration, supported by three core technical elements: automation (for reliability, consistency, and CI/CD), monitoring and observability (including the Data Observability Driven Development (DODD) method for end-to-end data visibility to prevent issues), and incident response (to rapidly identify and resolve problems). DataOps seeks to proactively address issues, build trust, and continuously improve data engineering workflows, despite the current immaturity of data operations compared to software engineering.
ETL (extract, transform, load) is the traditional data warehouse approach where the extract phase pulls data from source systems, the transform phase cleans and standardizes data while organizing and imposing business logic in a highly modeled form, and the load phase pushes the transformed data into the data warehouse target database system, and the processes are typically handled by external systems and work hand-in-hand with specific business structures and teams.
ELT (extract, load, transform) is a variation where data is moved more directly from production systems into a staging area in the data warehouse in raw form, and transformations are handled directly within the data warehouse itself rather than using external systems, that takes advantage of the massive computational power of cloud data warehouses and data processing tools, with data processed in batches and transformed output written into tables and views for analytics.
A data warehouse is a central data hub designed for reporting and analysis, characterized as a subject-oriented, integrated, nonvolatile, and time-variant collection of data that supports management decisions.
-
Data warehouses separate online analytical processing (OLAP) from production databases and centralizes data through ETL (extract, transform, load) or ELT (extract, load, transform) processes, organizing data into highly formatted structures optimized for analytics.
-
Data warehouses traditionally required significant enterprise budgets but have become more accessible through cloud models.
-
A data mart is a refined subset of a data warehouse specifically designed to serve the analytics and reporting needs of a single suborganization, department, or line of business.
-
Data marts exist to make data more accessible to analysts and provide an additional transformation stage beyond initial ETL/ELT pipelines, improving performance for complex queries by pre-joining and aggregating data.
-
Cloud data warehouses represent a significant evolution from on-premises architectures, pioneered by Amazon Redshift and popularized by Google BigQuery and Snowflake, which offer pay-as-you-go scalability, separate compute from storage using object storage for virtually limitless capacity, and can process petabytes of data in single queries.
-
Cloud data warehouses have expanded MPP capabilities to cover big data use cases that previously required Hadoop clusters, blurring the line between traditional data warehouses and data lakes while evolving into broader data platforms with enhanced capabilities.
-
A data lake is a central repository that stores all data—structured, semi-structured, and unstructured—in its raw format with virtually limitless capacity, emerging during the big data era as an alternative to structured data warehouses that promised democratized data access and flexible processing using technologies like Spark, but first-generation data lake 1.0 became known as a "data swamp" due to lack of schema management, data cataloging, and discovery tools, while being write-only and creating compliance issues with regulations like GDPR, CCPA etc.
-
A data lakehouse represents a convergence between data lakes and data warehouses, incorporating the controls, data management, and data structures found in data warehouses while still housing data in object storage and supporting various query and transformation engines, with ACID transaction support that addresses the limitations of first-generation data lakes by providing proper data management capabilities instead of the original write-only approach.
Lambda architecture is a data processing architecture that handles both batch and streaming data through three independent systems: a batch layer that processes historical data in systems like data warehouses, a speed layer that processes real-time data with low latency using NoSQL databases, and a serving layer that combines results from both layers, though it faces challenges with managing multiple codebases and reconciling data between systems.
Kappa architecture is an alternative to Lambda that uses a stream-processing platform as the backbone for all data handling—ingestion, storage, and serving—enabling both real-time and batch processing on the same data by reading live event streams and replaying large chunks for batch processing, though it hasn’t seen widespread adoption due to streaming complexity and cost compared to traditional batch processing.
The Dataflow model, developed by Google and implemented through Apache Beam, addresses the challenge of unifying batch and streaming data by viewing all data as events where aggregation is performed over various windows, treating real-time streams as unbounded data and batches as bounded event streams, enabling both processing types to happen in the same system using nearly identical code through the philosophy of "batch as a special case of streaming," which has been adopted by frameworks like Flink and Spark.
IoT devices are physical hardware that sense their environment and collect/transmit data, connected through IoT gateways that serve as hubs for data retention and internet routing, with ingestion flowing into event architectures that vary from real-time streaming to batch uploads depending on connectivity, storage requirements ranging from batch object storage for remote sensors to message queues for real-time responses, and serving patterns spanning from batch reports to real-time anomaly detection with reverse ETL patterns where analyzed sensor data is sent back to reconfigure and optimize manufacturing equipment.
An Online Transactional Processing (OLTP) data system is designed as an application database to store the state of an application, typically supporting atomicity, consistency, isolation, and durability as part of ACID characteristics, but not ideal for large-scale analytics or bulk queries.
In contrast, an online analytical processing (OLAP) system is designed for large-scale, interactive analytics queries, but making it inefficient for single record lookups but enabling its use as a data source for machine learning models or reverse ETL workflows, while the online part implies the system constantly listens for incoming queries.
A message is a discrete piece of raw data communicated between systems, which is typically removed from a queue once it’s delivered and consumed, while a stream is an append-only, ordered log of event records that are persisted over a longer duration to allow for complex analysis of what happened over many events.
While time is an essential consideration for all data ingestion, it becomes that much more critical and subtle in the context of streaming, where event time is when an event is generated at the source; ingestion time is when it enters a storage system like a message queue, cache, memory, object storage, or a database; process time is when it undergoes transformation; and processing time measures how long that transformation took, measured in seconds, minutes, hours, etc.
A Relational Database Management System (RDBMS) stores data in tables of relations (rows) with fields (columns), typically indexed by a primary key and supporting foreign keys for joins, and is popular, ACID compliant, and ideal for storing rapidly changing application states, often employing normalization to prevent data duplication.
NoSQL, or "not only SQL," databases emerged as alternatives to relational systems, offering improved performance, scalability, and schema flexibility by abandoning traditional RDBMS constraints like strong consistency, joins, or fixed schemas; they encompass diverse types such as key-value, document, wide-column, graph, search, and time series, each tailored for specific workloads.
-
A key-value database is a nonrelational database that uniquely identifies and retrieves records using a key, similar to a hash map but more scalable, offering diverse performance characteristics for use cases ranging from high-speed, temporary caching to durable persistence for massive event state changes.
-
A document store, a specialized key-value database, organizes nested objects (documents, often JSON-like) into collections for key-based retrieval; while offering schema flexibility, it is typically not ACID compliant and lacks joins, promoting denormalization, and is often eventually consistent.
-
A wide-column database is optimized for massive data storage, high transaction rates, and low latency, scaling to petabytes and millions of requests per second, making them popular in various industries; however, they only support rapid scans with a single row key index, necessitating data extraction to secondary analytics systems for complex queries. Note that wide-column refers to the database’s architecture, allowing flexible and sparse columns per row, which is distinct from a wide table, a denormalized data modeling concept with many columns.
-
Graph databases explicitly store data with a mathematical graph structure (nodes and edges), making them ideal for analyzing connectivity and complex traversals between elements, unlike other databases that struggle with such queries; they utilize specialized query languages and present unique challenges for data engineers in terms of data mapping and analytics tool adoption.
-
A search database is a nonrelational database optimized for fast search and retrieval of complex and simple semantic and structural data, primarily used for text search (exact, fuzzy, or semantic matching) and log analysis (anomaly detection, real-time monitoring, security, and operational analytics), often leveraging indexes for speed.
-
A time-series database is optimized for retrieving and statistically processing time-ordered data, handling high-velocity, often write-heavy workloads (including regularly generated measurement data and irregularly created event-based data) by utilizing memory buffering for fast writes and reads; its timestamp-ordered schema makes it suitable for operational analytics, though generally not for BI due to a typical lack of joins.
APIs are standard for data exchange in the cloud and between systems, with HTTP-based APIs being most popular.
-
REST (Representational State Transfer) is a dominant, stateless API paradigm built around HTTP verbs, though it lacks a full specification and requires domain knowledge.
-
GraphQL is an alternative query language that allows retrieving multiple data models in a single request, offering more flexibility than REST.
-
Webhooks are event-based, reverse APIs where the data source triggers an HTTP endpoint on the consumer side when specified events occur.
-
RPC (Remote Procedure Call) allows running procedures on remote systems; gRPC, developed by Google, is an efficient RPC library built on Protocol Buffers for bidirectional data exchange over HTTP/2.
Event-driven architectures, critical for data apps and real-time analytics, leverage message queues and event-streaming platforms, which can serve as source systems and span the data engineering lifecycle.
-
Message queues asynchronously send small, individual messages between decoupled systems using a publish/subscribe model, buffering messages, ensuring durability, and handling delivery frequency (exactly once or at least once), though message ordering can be fuzzy in distributed systems.
-
Event-streaming platforms are a continuation of message queues but primarily ingest and process data as an ordered, replayable log of records (events with key, value, timestamp), organized into topics (collections of related events) and subdivided into stream partitions for parallelism and fault tolerance, with careful partition key selection needed to avoid hotspotting.
Distributed systems navigate a fundamental trade-off between performance and data integrity, where systems built on the ACID model guarantee strong consistency for data correctness at the cost of higher latency, while those following the BASE model favor high availability and scalability through eventual consistency, which tolerates temporary data staleness.
File storage systems—whether local (like NTFS and ext4), network-attached (NAS), storage area network (SAN), or cloud-based (like Amazon EFS)—organize data into directory trees with specific read/write characteristics; however, for modern data pipelines, object storage, such as Amazon S3, is often the preferred approach.
Block storage, the raw storage provided by disks and virtualized in the cloud, offers fine-grained control and is fundamental for transactional databases and VM boot disks, with solutions like RAID, SAN, and cloud-specific offerings (e.g., Amazon EBS) providing varying levels of performance, durability, and features, alongside local instance volumes for high-performance, ephemeral caching.
-
A block is the smallest addressable unit of data supported by a disk, typically 4,096 bytes on modern disks, containing extra bits for error detection/correction and metadata.
-
On magnetic disks, blocks are geometrically arranged, and reading blocks on different tracks requires a "seek," a time-consuming operation that is negligible on SSDs.
-
RAID (Redundant Array of Independent Disks) controls multiple disks to improve data durability, enhance performance, and combine capacity, appearing as a single block device to the operating system.
-
SAN (Storage Area Network) systems provide virtualized block storage devices over a network, offering fine-grained scaling, enhanced performance, availability, and durability.
-
Cloud virtualized block storage solutions, like Amazon EBS, abstract away SAN clusters and networking, providing various tiers of service with different performance characteristics (IOPS and throughput), backed by SSDs for higher performance or magnetic disks for lower cost per gigabyte.
-
Local instance volumes are block storage physically attached to the host server running a VM, offering low cost, low latency, and high IOPS, but their data is lost when the VM shuts down or is deleted, and they lack advanced virtualization features like replication or snapshots.
Object storage, a key-value store for immutable data objects (like files, images, and videos) popular in big data and cloud environments (e.g., Amazon S3), offers high performance for parallel reads/writes, scalability, durability, and various storage tiers, but lacks in-place modification and true directory hierarchies, making it ideal for data lakes and ML pipelines despite consistency and versioning complexities.
-
Object storage is a key-value store for immutable data objects (files like TXT, CSV, JSON, images, videos, audio) that has gained popularity with big data and cloud computing (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage).
-
Objects are written once as a stream of bytes and become immutable; changes require rewriting the full object.
-
Object stores support highly performant parallel stream writes and reads, scaling with the number of streams and virtual machines, making them ideal for high-volume web traffic and distributed query engines.
-
Cloud object storage offers high durability and availability by saving data in multiple availability zones, with various storage classes available at discounted prices for reduced access or durability.
-
Object stores are a key ingredient in separating compute and storage, enabling ephemeral clusters and virtually limitless, scalable storage.
-
For data engineering, object stores excel at large batch reads and writes, serving as the gold standard for data lakes, and are suitable for unstructured data in ML pipelines.
-
Object lookup relies on a top-level logical container (bucket) and a key, lacking a true directory hierarchy; consequently, "directory"-level operations can be costly due to the need for key prefix filtering.
-
Object consistency can be eventual or strong (latter often achieved with an external database), and object versioning allows retaining previous immutable versions, though at an increased storage cost.
-
Lifecycle policies allow automatic deletion of old versions or archiving to cheaper tiers.
-
Object store-backed filesystems (e.g., s3fs, Amazon S3 File Gateway) allow mounting object storage as local storage, but are best suited for infrequently updated files due to the immutability of objects.
The Hadoop Distributed File System (HDFS), based on Google File System (GFS), is a distributed storage system that breaks large files into blocks managed by a NameNode, with data replicated across multiple nodes for durability and availability, and it combines compute and storage on the same nodes, unlike object stores.
-
HDFS breaks large files into blocks (less than a few hundred megabytes), managed by a NameNode that maintains directories, file metadata, and block locations.
-
Data is typically replicated to three nodes for high durability and availability; the NameNode ensures replication factor is maintained.
-
HDFS combines compute resources with storage nodes, originally for the MapReduce programming model, allowing in-place data processing.
-
While some Hadoop ecosystem tools are declining, HDFS remains widely used in legacy installations and as a key component in current big data engines like Amazon EMR, often running with Apache Spark.
Indexes, partitioning, and clustering are database optimization techniques that have evolved from traditional row-oriented indexing to columnar storage, enabling efficient analytical queries through data organization and pruning.
-
Indexes provide a map of the table for particular fields and allow extremely fast lookup of individual records.
-
Columnar databases store data by column, enabling faster scans for analytical queries by reading only relevant columns and achieving high compression ratios; while historically poor for joins, their performance has significantly improved.
-
Partitioning divides a table into smaller subtables based on a field (e.g., date-based partitioning) to reduce the amount of data scanned for a query.
-
Clustering organizes data within partitions by sorting it based on one or more fields, co-locating similar values to improve performance for filtering, sorting, and joining.
-
Snowflake Micro-partitioning is an advanced approach where data is automatically clustered into small, immutable units (50-500 MB) based on repeated values across fields, allowing aggressive query pruning using metadata that describes each micro-partition’s contents.