Data Engineers are responsible for building and maintaining data processing pipelines that are used by the data scientists to build their models and deploying the models into production. Thus, data engineering is a very versatile and interdisciplinary field that lies at the intersection of Software Engineering, Data Science and DevOps. In order to find a job as Data Engineer, it is good to have some understanding in each of these areas.
Data engineering is a very versatile and interdisciplinary field that lies at the intersection of Software Engineering, Data Science and DevOps.
To give you an idea of what kind of skills employers look for, we went through several job listings for data engineering positions and compiled a list of skills that are important in the field. You do not need to know all of these technologies and this is by no means an exhaustive list, but this should give you some orientation in the field.
You do not need to know all of these technologies and this is by no means an exhaustive list, but this should give you some orientation in the field.
Python is the most widespread programming language when it comes to working with data, including data engineering. Python has a huge ecosystem of libraries for data analysis, so it is perhaps the best language for the job. Other popular languages are Scala, Java and R.
Big Data frameworks
As a Data Engineer you would be expected to handle and store large quantities of data. These days several big data frameworks are used to process data in streaming or batch fashion on top of clusters of machines. The most popular frameworks for this task are Spark, Hadoop, Kafka and Flink. Getting your hands dirty with one or more of these frameworks is a good idea while looking for a Data Engineer job.
Infrastructure and APIs
Data Engineers are expected to maintain infrastructure and data available via APIs for data scientists to develop and run their models. The infrastructure typically runs on a cloud provider such as AWS, GCP or Azure. The set up and maintenance is done technologies such as Docker, Kubernetes and Terraform. For building and maintaining data pipelines, Airflow is typically used. Thus, it is a good idea to develop some skills working with one or more of these tools.
Relational databases and NoSQL databases are the two major types of databases. They both have their own merits and shortcomings and depending on the use case and requirements, one might choose to use either of the two types. Among relational databases, PostgreSQL and MySQL are the most commonly used ones, whereas Elasticsearch, DynamoDB, MongoDB, CouchDB are the most popular NoSQL options. Knowledge of at least one database in general is definitely a good idea. It's even better if you know one database from each of the two categories.
Data science libraries
As Data Engineer you are expected to have at least some basic knowledge of gathering and pre-processing data, so that it can be used by the Data Science team. Libraries such as NumPy, Pandas, Jupyter and Matplotlib are the best tools for the job.
Join the newsletter to receive the latest updates in your inbox.