When thinking of the modern data industry, affectionately called “Big Data,” it’s easy to lump all “data people” into the catch-all term “data scientist.” But the reality is that there are many related disciplines necessary for handling Big Data problems at the enterprise level.
Setting aside for a moment database administrators (often referred to as DBAs), we’re still left with data analysts, data engineers, and data scientists. While human resources may not always know the actual difference between these related roles in the company, they are quite different in terms of day-to-day responsibilities and expertise.
Data analysts typically work in the “data warehousing” side of things with tools like Snowflake, Amazon Redshift, and Google BigQuery. Generally, they’re responsible for moving structured data that’s neatly organized in the systems of record into high-performance data warehouses and team-specific “data marts” to produce analytics and business intelligence (BI) reports.
In comparison, data engineers tend to be assigned to “data engineering” and “event streaming” projects. The role of a data engineer is conceptually similar to that of a data analyst, but the main difference is that a data engineer is more likely to specialize in handling semi-structured, unstructured, and streaming data (such as from real-time events) than a “pure” data analyst.
In order to handle data that may have duplicate or incomplete records, a data engineer must rely on tools such as Airflow, dbt, Fivetran, or Airbyte in order to extract, transform, and load (ETL) data. (In fact, many data engineers now prefer to load the data before transforming it, resulting in an ELT process.) These complex processes are often partially manual and can involve data lakes and streaming data engines – software such as Apache Spark, Kafka, and Amazon Kinesis.
“Data science” and “machine learning” (ML) are the last two data-related disciplines that we’ll cover, and these projects tend to be completed by individuals with titles like “data scientist.” Data scientists, like data engineers, are often accustomed to working with all types of data – so data scientists may make use of the same data lakes and various data preparation tools as data engineers use. However, data scientists generally transform their data with the ultimate goal of tackling data science or ML problems, while data engineers are stereotypically more interested in creating repeatable engineering processes to support other parts of their organizations.
Compared to data analysts, who may deal with a lot of one-off report generation for business intelligence and competitive analysis, data scientists tend to want to draw statistical conclusions (to prove or disprove a hypothesis) or are helping create ML apps (like ML-powered image recognition). That means data scientists love to use software like Scikit-learn, TensorFlow, or PyTorch for their data science and ML work. These frameworks tend to be more specialized to data science or ML workflows than related tools in data engineering, which may not be capable of supporting the selection, training, and evaluation of a ML data model, for example.
Meanwhile, data engineers will commonly take data from data warehouses, data marts, and analytical reports; transform that data into different formats; and then hand it off to data scientists or data analysts. They’re likely getting their hands dirty with programmatic setup and configuration as part of complex data engineering projects that can take months to complete. Building in-product analytics for a software as a service (SaaS) company is an example of a project that typically requires a team of data engineers. That type of project is a little less likely to involve data scientists, unless there’s a need for statistical analysis or ML-powered features.
We’ve seen that these three “Big Data” career paths are related and have a lot of overlap, but the main differences between data engineers, scientists, and analysts comes down to two things: 1) the typical problems they’re trying to solve and 2) their choice of tools to do so.
A data analyst is most likely to be associated with “business intelligence” (BI) problems, meaning that they’ve been tasked with generating actionable BI for the company. While they often use data engineering tools and are probably comfortable setting up data warehouses, an organization’s data analysts are probably the ones setting up team-specific analytics reports via data marts. They may be attached to teams of business analysts or to individual functions of an organization (like marketing), or they may report to executive management on a regular basis.
Meanwhile, a data engineer is someone who typically is a little less focused on BI reporting and who instead is responsible for cleaning up and processing complex data. They may use more “programmatic” approaches (like a software engineer) and are probably comfortable taking manual steps to extract, load, and transform (ELT) data. Data engineers are probably familiar with the difference between a data warehouse and a data lake, and they’re often involved in platform-level initiatives around event-driven architecture for real-time streaming analytics.
Last, but definitely not least, data scientists likely have more of a research background, at least by formal training and educational curriculum. Experts in machine learning (ML) and statistical analysis are much more likely to use the term data scientist, though there are many who have job titles as statisticians (statistical analysts), informaticians (information scientists), or ML engineers. Given that ML can theoretically be applied to almost any problem imaginable, data scientists are incredibly in demand as organizations try to optimize their businesses and deliver value to customers. But, they aren’t usually the ones providing BI up the chain to the CEO.
While the job descriptions for each data discipline are far from set in stone, it’s useful to understand the similarities and differences between data science, data engineering, and data analytics.
Overall, there’s a continuum between statistical machine learning on one side – “pure” data science and ML – and one-off manual reporting to support executive decision making on the other – “pure” data analytics and BI. Data engineers are somewhere in the middle, and they are often deeply involved in software engineering and product architecture.
There are no hard and fast rules in Big Data, and data-related disciplines are changing faster than just about any other part of the technology space as the size of data continues to grow. If you’re not quite sure what someone’s experience is in data science, analytics, or engineering, just ask them about the types of projects they like to work on and the tools they prefer using.
You can also ask if they prefer specifics (like engineering event streaming software architecture) or if they are generally comfortable working with a wide variety of data-related projects. In the end, keep in mind that job titles in Big Data both mean a lot and nothing at the same time; they can be useful to deepen your understanding, but they shouldn’t be used to box someone in.