Problem

A colleague reached out to me with an interesting problem. He wanted to extract certain error codes/phrases from sentences. He had a list of all possible codes and phrases and wanted it to be dynamic so that he can change the list of possible codes in the future. For example, imagine you survey 10 BI developers on which languages they use on a daily basis. Some might say Python, SQL, R, Excel etc. We have a list of all possible languages and we want to extract only the languages for that developer based on this list. If a language does not exist in the list, it should return a null value. See below what the final solution should look like:

Solution

I couldn't find a solution online so I created one using List functions in M. All of the steps below can be wrapped into one nice function, but for demonstration purposes I will do it step by step. If you use this solution, just convert it into a function as needed. Also to keep the blog short, I am going to assume that the words/phrases match exactly with those in the list. You will need to format them (uppercase/lowercase/remove punctuations etc.) based on your scenario.

We start with 2 columns. First column contains unique id for each respondent and the second column has their text responses. We also have a list of languages as another query.

Step 1 : Clean and Trim

Create a new column with the text trimmed and cleaned. You can go to Add column > Format > Trim > Clean

If their are any punctuations, you should remove those as well in this step. We need each of the words to be isolated. For example, if the sentence contains "SQL," it will be treated as a one word. Remove the comma to turn "SQL," into "SQL".

Step 2 : Split the Text using Text.Split

In this step we split the setence on each row into separate words and store them as a list using Text.Split() function.

Notice after splitting we get a list on each row.

Step 2 : Find Matching Words Using List.Intersect with List.Buffer

Now we need to lookup the pre-defined list to see if any of the words in the above list exist. The expected result is that if the list contains any of the pre-defined words, those should be returned otherwise we should get a null value. We can achieve this using List.Intersect(). I am wrapping the list with List.Buffer() to make lookup faster.

Notice below, the Person1 uses Python so the resulting list on the first row contains only "Python" and no other words.

Step 3 : Convert List to Table Using Table.FromList

Convert the intersected list into a table using Table.FromList.

Step 4 : Expand the Table

Now we just need to expand the table to get each matching word on a separate row for each person. You can eaither use Table.ExpandTableColumn or just click on the two double arrows

After extracting :

Step 5 : Remove Intermediate Columns

Remove all the intermediate columns from steps 2-4 to keep only the resulting column with matching words! As I mentioned above, all these steps can be rolled into one nice function that can be used elsewhere as well.

Person2 had two languages Python & R so we get two lines, whereas person5 & 8 had no matching languages thus null.

Final M code:

You can download the .pbix from here. If you know a better, faster, more elegant solution, please let me know.