In this article, I will share with you the resources and methodology that I have actively used to complete my Databricks Certified Associate Developer for Apache Spark 3.0 certification.
First of all, I want to note that at the time of passing the exam (03/28/2021) the latest version of Spark 3.1.1 had already been released, but on the exam we worked with the major release of Spark 3.0, link on him.
The exam is available in Python or Scala. You will not be tested for your knowledge of these two languages, so you should choose the language in which you feel comfortable. On the exam, you will be tested specifically for your knowledge of Apache Spark. As for me, I chose Python. The exam window is divided into two areas:
– Left: exam questions and answer options.
– Right: this area, in turn, is divided into two more – upper and lower parts. At the top, you will have a giant PDF with official documentation (link), in which you CANNOT use the search, this option is simply missing. The bottom is a kind of notepad where you can write your own observations, this is not a shell or something like that, it is just a notepad.
My exam recommendations
1. Regardless of whether you have enough knowledge about Spark or not, in my opinion, the book Spark The Definitive Guide (book) just needs to be read. Today this book may seem too simple, it describes version 3.0, but as a personal recommendation, I recommend it, since in practice the basics are essentially the same.
2. Very important – make sure you are comfortable with the PDF documentation provided. For example, if you need to look at what the withColumn method is responsible for, you should know that this method is in the Dataframes package and not in the SparkSession package. This was extremely important to note as it is helpful to have an understanding of the PDF structure.
3. The exam is mainly focused on the Dataframe API, so if you only know SQL and don’t know how dataframes work, don’t take the exam because you will fail.
4. You should know how the Spark architecture works and its hierarchy: Jobs, Stages, Tasks, Partitions, Accumulators, Workers, Driver, etc. etc.
There are many resources available on the Internet on this topic, but in this article I would like to share a few links where I think there is great content about Spark.
1. https://www.linkedin.com/company/justenough-spark/ … Some of their test publications are very similar to exam publications and are often of very good content.
3. On Youtube, I recommend that you familiarize yourself with the channel Brian Kafferki (Bryan Cafferky), where one of the playlists is about Databricks and Apache Spark topics. At the time of this writing, the video series is not yet complete, but the ones available at the moment are very helpful. Besides these videos, his YouTube channel is a source of good content.
4. Another useful resource is the website. Bartosz Konechno (Bartosz Konieczny), which is almost entirely devoted to content of varying degrees of complexity on the topic of date engineering.
I hope this article helps you better understand how best to approach your certification preparation. In case of failure, remember that this is not a failure at all, but an opportunity to find out where you should focus more attention and what to learn a little more.
Update from 03/04/2021:
I want to share a few more additional tips that were born after several days of thinking about how to help people get certified.
5. Tuning memory in Spark. You should have a basic understanding of how memory works in Spark. Yes, this topic is quite broad, but you should have at least minimal knowledge of the topic, and everything that you find by going to the next link, should sound familiar to you. If you have a desire to delve into this topic, then go ahead! But at the very least, you should understand the basic concepts.
6. Regarding the documentation that is provided to you for the exam (PDF), as I mentioned earlier, there is no need to memorize all the functions, you just need to know which section they are in (SparkSession, Dataframes, functions, Row, and so on). Another thing to take into account is that links in PDF are disabled, that is, for example, links in a function to_datethat appear in the online documentation do not work in the exam PDF.
7. Read each question slowly. Some questions are not phrased directly, so you need to read very slowly and very well to understand what exactly you are being asked about. Some trick questions. For example, one of these says: “return a new dataframe which has a column with the average salary”
avg… You have to be very careful here, because the method
withColumn ADD another column to the original dataframe, and in the example they are requesting a single column dataframe. If they want you to use the method
withColumn, they usually hint to you like this: “in addition to the current columns, we want a new column with xxxxx” (in addition to the current columns, we need a new column with xxxxx). Perhaps reading this now, you will say, “Colleague, this is very simple, I would never have fallen for this,” but during the exam you are nervous and in a hurry to finish it quickly, so pay attention to what you are asked about.
Material prepared as part of the course Spark Developer.
We invite everyone to a free demo lesson “Orchestration of data processing processes”… In an open lesson, we will analyze the purpose of orchestrators in ETL processes, we will work with Apache Oozie and Airflow.