A VLOOKUP alternative to pulling Excel data using the Python Pandas Module

The VLOOKUP function in Excel is used to pull data from a row knowing at least one piece of information from a column about that row, however it has its limits. For instance, you cannot pull data using multiple criteria. Additionally, if you have a list of items within your Excel dataset that may contain multiple worksheets you want to pull from, you will have to use Visual Basic and create a sophisticated loop.

To deal with these issues, I decided to expand my on Python background and created a visual tool to not only be able to pull data from an Excel workbook by knowing one piece of information from a column, but even more specifically, by two pieces of information.

The tool works by uploading the Excel workbook you want to extract data from. This populates Column filter option #1 and Column filter option #2 dropdown box, and the Select columns to extract listbox with the column headings found within the workbook. This includes all worksheet column headings. In order for this to work, the column headings must be located in the first row in each sheet, otherwise they will not be read. You then upload another Excel workbook with the first worksheet in column 1 and/or column 2 containing data you want to filter your main data-containing Excel table by. After pushing Submit, this then creates a new table with all of the information you selected to pull for each item which you can save as a separate Excel table.

Currently it works only as a categorical filter, you cannot use it to filter by a conditional where if you want to extract specific columns of data from a row greater than or less than a set value. At that point, it’s better to create your new Excel data table with this tool, and then use the filter feature in Excel to narrow your table by conditional. Although in the future, I may update the code to include more advanced options.

How to Use

First, you’ll need a few modules to get it working. In your terminal, use pip to install:

  1. pandas
  2. PySimpleGUI
  3. numpy

Then, create a new Python file, data_extractor_tool.py or whatever you want to call it. Finally, copy and paste the source code below and run the code.\ To use this tool, click the Browse button under Excel data workbook. Then locate your Excel workbook containing data to extract. The workbook that I’m using in this example is Accidental Drug Related Deaths 2012-2018.

Once found, this should populate Column filter option 1 and Column filter option 2 dropdown boxes, and Select columns to extract list box.

Next, under Excel data filter, upload the Excel worksheet you want to filter your dataset by. I have one called drug_death_filter.xlsx. In this table, only columns A and B, can have data in them. If you have data in another column, it will not be used as a filter. These correspond to Column filter option 1 and Column filter option 2, respectively.

In my scenario, I am filtering by Race, and a general category Y, meaning Yes.

Depending on how many filter options I choose for my data (1 and/or 2) and how many columns I select in the Select columns to Extract listbox, I will get a different Excel table. For example, if I choose to filter just by Race alone, I will get just all Whites. This is how you do that.

In Column Filter option 1, select Race, in Column Filter option 2 leave it blank (either delete or use the drop down and find the blank space), and in the Select Columns to Extract, select all the other information to extract including Race. If you do not include Race in Select columns to extract, then you’ll get a popup that will tell you there is no data.

The following image is my output.

This table contains all whites that died from drug related deaths. To save the data, push Save As… which will create a new Excel table.

To limit my criteria to find all whites that died from cocaine, I use Column filter option 1 and Column filter option 2. In Column filter option 1 Select Race, in Column filter option 2, Select Cocaine, and in Select columns to Extract, choose Race and Cocaine.

This table has all the whites who died from cocaine.

You can also just use Column filter option 2 to filter the data as well with generic Y responses. Just make sure that the Column filter option 2 and Select columns to Extract have Y responses as well.

And the output for requested data

So this found all the values of people who died from cocaine.

In this example, I used a dataset with only one worksheet and a data filter with only one item to filter by (White and generic Y). However, this program will work with multiple data sheets and multiple items. If you had, for example, a list of part numbers in a column filter option 1, you can find ALL part number information in the entire dataset.

Threshold Button

The threshold button only shows columns that have data greater than the value specified, this clears up any columns that do not have much information in them. Below is an image of what happens when the threshold is too large and there is not enough data in the columns.

By the way, the code is open source so feel free to use, or modify it for your own purposes.

Source Code:

#=============================================================================
# Functions
#    drop_dup(x)
#        description: Drops duplicate values from logColumnList
#        x: List object as argument
#    save_file(fileName)
#        fileName: String argument
#    update_threshold(threshold)
#        description: Adds elements to display in list, combo boxes, based
#        on threshold value.
#        threshold: Integer argument
#    append_column_list()
#        description: Updates the column list and returns updateList 
#        containing sheets as workSheetStore and columns as logColumnList.
# Attributes
#    workSheetStore: List object that holds all worksheets.
#    logColumnList: List object that holds all columns from worksheets.
#=============================================================================

def drop_dup(x):
    return list(dict.fromkeys(x))

def save_file(FILENAME):
    SAVEEXCEL.Element("_FILENAME_").Update(FILENAME)
    MOUNTFRAME.to_excel(FILENAME, 
                        index=False)
    gui.Popup("Your parts list file was created:","{}".format(FILENAME), 
               title="Success!")
    SAVEEXCEL.Close()
    SUBMITACTIVE = False
    return SUBMITACTIVE

def update_threshold(threshold):
    LOGCOLUMNLIST = []
    DLOGNEW = pd.DataFrame()
    DATALOGDFS = pd.read_excel(values["_INPUTTXT1_"],
                               sheet_name=None,
                               usecols=None,
                               na_values="NaT")

    WORKSHEETSTORE = [sheets 
                      for sheets 
                      in DATALOGDFS 
                      if not "Sheet" 
                      in sheets]

    for sheet in WORKSHEETSTORE:
        DLOGNEW = DLOGNEW.append(DATALOGDFS[sheet].dropna(axis=1, 
                                                          thresh=threshold), 
                                 ignore_index=True, 
                                 sort=True)
        for columnvalues in DLOGNEW:
            LOGCOLUMNLIST.append(columnvalues)

    LOGCOLUMNLIST = drop_dup(LOGCOLUMNLIST)
    
    return LOGCOLUMNLIST, WORKSHEETSTORE

def append_column_list():
    try:
        UPDATELIST = update_threshold(int(values["_THRESHOLD_"]))
        MAIN.Element("_COLLIST_").Update(values=UPDATELIST[0])
        UPDATELIST[0].append('') # Appends to logColumnList stored as element [0].
        MAIN.Element("_COMBO_").Update(values=UPDATELIST[0])
        MAIN.Element("_COMBO2_").Update(values=UPDATELIST[0])  
        
    except ValueError:
        UPDATELIST = ""
        MAIN.Element("_THRESHOLD_").Update("")
        gui.Popup("Threshold value must be an integer.",
                  title="Error")
    return UPDATELIST

def save_excel_ui():
    SAVEEXCELLAYOUT = [[gui.Text("This is what your excel data file will look like:")],
                       [gui.Multiline(MOUNTFRAME, 
                                      size=[35,20],
                                      auto_size_text=True)],
                       [gui.Input(do_not_clear=False, 
                                  key="_FILENAME_", 
                                  enable_events=True, 
                                  visible=True,
                                  disabled=True)],
                       [gui.FileSaveAs(file_types=(("Excel Files (*.xls*)", "*.xls*"),), 
                                       enable_events=True, 
                                       target="_FILENAME_"), 
                        gui.Cancel(key="_CANCEL_")]
                      ]
   
    SAVEEXCEL = gui.Window("Save Output Data to Excel", 
                                layout=SAVEEXCELLAYOUT,
                                resizable=True)
    return SAVEEXCEL

def file_name_filter(FNAMES, SUBSTR):
    return [str 
            for str in FNAMES 
            if any(sub in str for sub in SUBSTR)]

#=============================================================================
# Main GUI layout and Settings
#=============================================================================
    
gui.SetOptions(background_color="white",
               scrollbar_color="white",
               input_elements_background_color="white",
               element_background_color='white', 
               button_color=('#ff1a1a','black'))


FACETCOL = [[gui.T("Image Folder:"),
             gui.InputText(size=(22,1),
                           key="_INPUTTXT3_"),
             gui.FolderBrowse(target="_INPUTTXT3_")]
           ]

LAYOUT = [[gui.Graph(canvas_size=(475,43),
                     key="_LOGO_",
                     graph_bottom_left=(0,0),
                     graph_top_right=(295,50),
                     background_color="white")],
          [gui.Text("Excel data workbook:",
                    size=(22,1)),
           gui.InputText(size=(25,1),
                         key="_INPUTTXT1_",
                         enable_events=True),
           gui.FileBrowse(target="_INPUTTXT1_",
                          file_types=(("Excel Files (*.xls*)", "*.xls*"),)),],
          [gui.Text("Column filter option 1",
                    size=(22,1)),
           gui.InputCombo(values=[],
                          size=[15,1],
                          enable_events=True,
                          key="_COMBO_")],
          [gui.Text("Column filter option 2",
                    size=(22,1)),
           gui.InputCombo(values=[],
                          size=[15,1],
                          enable_events=True,
                          key="_COMBO2_")],
          [gui.Text("Select columns to extract:",
                    size=(22,1)),
           gui.Listbox(values=[],
                       size=[15,5],
                       select_mode="multiple",
                       enable_events=True,
                       key=("_COLLIST_")),
           gui.Text("Threshold:"),
           gui.Input(default_text="30",
                     size=(4,1),
                     key="_THRESHOLD_",
                     enable_events=True),
           gui.Button(u"\u27f3",
                      font=["Arial", 10, "bold"],
                      key="_THRESHB_")],
          [gui.Text("Excel data filter:",
                    size=(22,1)),
           gui.InputText(size=(25,1),
                         key="_INPUTTXT2_"),
           gui.FileBrowse(target="_INPUTTXT2_",
                          file_types=(("Excel Files (*.xls*)", "*.xls*"),),
                          key="_PARTSBROWSE_")],
          [gui.Button("Submit",
                      key="_SUBMIT_"),
           gui.Exit(),
           gui.Text("",
                    size=(37,1)),
           gui.Text("Written by: \nO. Semenchenko",
                    font=["Arial", 6]),]
         ]

MAIN = gui.Window(title="Excel Data Extractor Tool v1.0",
                  layout=LAYOUT).Finalize()

# Add logo
# MAIN.Element("_LOGO_").DrawImage(filename="YOUR_LOGO.png", 
#                                      location=(0,50),)
#=============================================================================
# Main window loop event
# 
# Definitions
#
# Variables:
#    submitActive: Boolean to continue Excel save file window event.
#    start: Begins count of for loop of parts list at 1.
#    dataList: List object stores information retrieved from rows.
#    logList: List object stores index values.
#    LoopStop: Boolean that breaks index iteration loop if no items selected 
#    in combobox Column filter option.
#=============================================================================

while True:
    event, values = MAIN.Read()  
    
    if event in (None, "Exit"):
        break

    if event == "_INPUTTXT1_":
        try:
            UPDATELIST = append_column_list()
                
        except FileNotFoundError:
            pass
    
    if event == "_THRESHB_":
        try:
            UPDATELIST = append_column_list()
        
        except ValueError:
            if type(values["_THRESHOLD_"]) == str or type(values["_THRESHOLD_"]) == float:
                gui.Popup("Threshold value must be an integer.",
                          title="Error")
                
        except FileNotFoundError:
            pass

    if event == "_SUBMIT_":
        try:
            SUBMITACTIVE = True
            LOOPSTOP = False
            DATALIST = []
            LOGLIST = []
       
# Retrieve data from the values dictionary
            DATALOG_PATH, PARTSLIST_PATH = values["_INPUTTXT1_"], values["_INPUTTXT2_"]       
            
# Use column 0 to extract from parts list
            PARTSLISTDF = pd.read_excel(PARTSLIST_PATH, 
                                        usecols=[0,1], 
                                        header=None, 
                                        names=["0","1"],).drop_duplicates()

# Load all sheets, extract from columns selected in list box.
            for sheet in UPDATELIST[1]:
                try:
                    DATALOGDF = pd.read_excel(DATALOG_PATH,
                                              sheet_name=[sheet],
                                              usecols=values["_COLLIST_"])
                    for key in DATALOGDF:
                        LOGLIST.append(DATALOGDF[key].dropna(how="all"))
# Remove empty dataframes in the index of logList
                        for dataframe in LOGLIST:
                            if dataframe.empty:
                                LOGLIST.pop()
                                    
                except ValueError:
                    pass

            for START in range(len(PARTSLISTDF)):
                if LOOPSTOP == True:
                    break
                PARTSLISTCOL = [PARTSLISTDF.loc[START, "0"], PARTSLISTDF.loc[START, "1"]]
                for sheet in range(len(LOGLIST)):
                    if LOOPSTOP == True:
                        break
                    try:
                        if (values["_COMBO_"], values["_COMBO2_"]):
                            INDEXRETURN = LOGLIST[sheet][(LOGLIST[sheet][values["_COMBO_"]] == PARTSLISTCOL[0]) & (LOGLIST[sheet][values["_COMBO2_"]] ==PARTSLISTCOL[1])]
                            
                    except KeyError:
                        try:
                            if (values["_COMBO_"]):
                                INDEXRETURN = LOGLIST[sheet][(LOGLIST[sheet][values["_COMBO_"]] == PARTSLISTCOL[0])]
                            elif (values["_COMBO2_"]):
                                INDEXRETURN = LOGLIST[sheet][(LOGLIST[sheet][values["_COMBO2_"]] == PARTSLISTCOL[1])]
                            else: 
                                LOOPSTOP = True
                                gui.Popup("At least one item must be selected as a column filter option.",
                                          title="Error")
                        except KeyError:
                            pass
                    try:
                        if not INDEXRETURN.empty:
                           DINDEX = INDEXRETURN.index.values
                           for x in np.nditer(DINDEX):
                               SINDEX = LOGLIST[sheet].loc[int(x)]
                               DATALIST.append(SINDEX)
                    except NameError:
                        # Exception for when INDEXRETURN is not defined.
                        pass

            MOUNTFRAME = pd.DataFrame(DATALIST)
                
            if LOOPSTOP == True:
                LOOPSTOP = False
    
            if MOUNTFRAME.empty:
                SUBMITACTIVE = False
                MAIN.Refresh()
                gui.Popup("No data found.",
                          title="Update")
            else:
                SAVEEXCEL = save_excel_ui()
        
#=============================================================================
#  Save Excel file window loop event           
#=============================================================================        
                
            while SUBMITACTIVE:
                event2, values2 = SAVEEXCEL.Read()
    
                if event2 in (None, "_CANCEL_"):
                    SUBMITACTIVE = False
                    SAVEEXCEL.Close()
                    MAIN.Refresh() 
                    
                if event2 == "_FILENAME_":
# Verifies filename was specified at save
                    if not values2["_FILENAME_"]: 
                        SAVEEXCEL.Refresh()      
                    else:
                        try:
                            FILENAME = values2["_FILENAME_"]        
                            SUBMITACTIVE = save_file(FILENAME)
# If file extension does not exist, add to file name.
                        except ValueError: 
                            FILENAME = values2["_FILENAME_"] + '.xlsx'    
                            SUBMITACTIVE = save_file(FILENAME)

#=============================================================================
# Window close and Submit event exception                        
#=============================================================================
                            
        except FileNotFoundError:
            MAIN.Refresh()
MAIN.Close()

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *