Extracting Matching Words Using A Pre-Defined List in Power Query/M
Solution to extract words from a sentence in each row using a pre-defined list in Power Query
Problem
A colleague reached out to me with an interesting problem. He wanted to extract certain error codes/phrases from sentences. He had a list of all possible codes and phrases and wanted it to be dynamic so that he can change the list of possible codes in the future. For example, imagine you survey 10 BI developers on which languages they use on a daily basis. Some might say Python, SQL, R, Excel etc. We have a list of all possible languages and we want to extract only the languages for that developer based on this list. If a language does not exist in the list, it should return a null value. See below what the final solution should look like:
Solution
I couldn't find a solution online so I created one using List functions in M. All of the steps below can be wrapped into one nice function, but for demonstration purposes I will do it step by step. If you use this solution, just convert it into a function as needed. Also to keep the blog short, I am going to assume that the words/phrases match exactly with those in the list. You will need to format them (uppercase/lowercase/remove punctuations etc.) based on your scenario.
We start with 2 columns. First column contains unique id for each respondent and the second column has their text responses. We also have a list of languages as another query.
Step 1 : Clean and Trim
Create a new column with the text trimmed and cleaned. You can go to Add column > Format > Trim > Clean
If their are any punctuations, you should remove those as well in this step. We need each of the words to be isolated. For example, if the sentence contains "SQL," it will be treated as a one word. Remove the comma to turn "SQL," into "SQL".
List.Intersect
with List.Buffer
Step 2 : Find Matching Words Using Now we need to lookup the pre-defined list to see if any of the words in the above list exist. The expected result is that if the list contains any of the pre-defined words, those should be returned otherwise we should get a null value. We can achieve this using List.Intersect()
. I am wrapping the list with List.Buffer()
to make lookup faster.
Notice below, the Person1 uses Python so the resulting list on the first row contains only "Python" and no other words.
Step 5 : Remove Intermediate Columns
Remove all the intermediate columns from steps 2-4 to keep only the resulting column with matching words! As I mentioned above, all these steps can be rolled into one nice function that can be used elsewhere as well.
Person2 had two languages Python & R so we get two lines, whereas person5 & 8 had no matching languages thus null.
Final M code:
You can download the .pbix from here. If you know a better, faster, more elegant solution, please let me know.