This article is an excerpt from Course Report’s interview with our instructors Jonathan Frazier and Haythem Balti, two experts in data engineering. Check out the full interview on Course Report’s blog
What is the difference between data science and data engineering?
Haythem: When you’re looking at a company’s data, the data is usually spread out across many servers and architectures. So in order to actually access that data, combine it, and analyze it, there are two different processes – data engineering and data science.
- Data engineering is used to make sure we can access and organize the data, and facilitates analysis.
- Data science focuses on analyzing this data.
A data engineer also has different skills than a data scientist.
- A data engineer is more like a software developer, and is sometimes called a data developer. Data engineers apply software development principles to be able to organize data in the most efficient way so a data science can use it to improve business intelligence.
- Data scientists usually focus on analysis which includes techniques like machine learning, text mining, and artificial intelligence.
Usually, good data science starts with good data engineering. For example, let’s say we want to estimate the number of items sold on Amazon in the next month. We would gather the sales data from the last five years – a data engineer would use their skills to pull all the sales data into one single data source. Then a data scientist can access that data and analyze it.
Jonathan: To give another analogy for this, imagine we’re trying to bake a cake. The data engineer’s job is to get all of the right ingredients that you need to bake your cake, prepare them for baking, and get them in the right shape to do the work.
The data scientist actually bakes the cake. There are all sorts of different cooking techniques that you can use with your data to put it together in interesting ways and end up with a final product that has value.
What do companies’ data engineering programs look like with The Software Guild?
Haythem: The first time we taught the data engineering course we taught it face to face at the client’s office. For another iteration, we taught it online because our clients had employees located in different locations at different offices. There are a lot of tools that we leverage online that allow us to teach just like we are essentially in a traditional classroom.
Jonathan: One of the features of the Software Guild is the flexibility. We’ve taught in-person classes on location at company offices. We also have our own campuses in a few different locations around the country, where teams can come and host their classes. Ten students is recommended, but we handle classes of up to around 20.
Haythem: The length of the courses are determined on a case by case basis. The data engineering program is two, six-week tracks, but we try to work around what is best for the employees, to create a balance between work time and training time.
What does the data engineering curriculum include?
Haythem: The data engineering curriculum and all the projects revolve around three main steps: Extract, Transform, Load (ETL). Every time you analyze data, you need to extract data, then transform that data to get it into a format that we can use for analysis. Finally, we take those results and load them into a database or a file.
We teach students how to use Python and other tools to access data, and read data from SQL databases, such as MySQL, NoSQL, and MongoDB. We also cover distributed computing such as Hadoop or Spark, and make sure students know how to read data, save data, analyze and transform data. At the end of the six weeks, students showcase a project they worked on as a team.
Jonathan: Something really important is the fact that instruction is focused around individual work and group work because both are a very important part of software development and data engineering. Many of the complex problems that we work on require effort from more than one person. So it’s really important to teach skills for working effectively in teams.
Not all employees are already strong in software engineering concepts. So one of the benefits of working with the Software Guild is that we have a lot of experience training people who essentially don’t have a tech technical background at all. We can guide employees from low technical skills through the basics, through more advanced concepts like data engineering into data science, where you get to really leverage some of the powerful data analytic tools that exist right now.
What advice do you have for companies who are thinking about retraining their employees in data engineering?
Haythem: My advice is to focus on identifying the skill gaps. You have to take the time to make sure you are solving the right skill gaps, or implementing the right training programs. That means establishing the objectives of the training and what tools, concepts, and curriculum should be covered in that training. So spend more time talking to your employees about what would make their jobs more efficient. Once you have that information, writing and delivering the content will be a relatively easier task.
Jonathan: Building on that, my advice is to find the pain-point that your company is dealing with when it comes to data, and understanding how far employees can go, even with just basic software development skills. You can actually do a lot once you’ve learned the basics. So getting started and learning that foundation is something that’s really important, and can never be wasted effort.