Tutorial: Python-Script to find and remove duplicates

cdb · October 10, 2022, 9:05pm

Do you want an automation that finds and removes duplicated rows in one of your bases? Then try this python-code.

This code loops through all the rows in your base. It finds duplicates and leaves only the newest or the oldest entries. You can decide which rows should be kept.

You can either run this script manually or create an automation (Enterprise only) that is executed every time a row is updated.

And here is the code that should be self-explanatory:

# This python script searches for similar rows and deletes these rows
# if you want to compare multiple columns, then add a *formula column* and concatenate the different columns
# here is an example for such a formula : {ColumnA}&"_"&{ColumnB}&"_"&{ColumnC}

# INPUT VARIABLES (only change these three values!):
table_name = 'Table1'
view_name = 'Default View'
column_to_compare = 'the-name-of-the-column'

# DO NOT CHANGE ANYTHING BELOW THIS LINE !
from seatable_api import Base, context

server_url = context.server_url
api_token = context.api_token

def remove_duplicates():
    print("start to search for duplicate rows ...")

    # AUTH
    base = Base(api_token, server_url)
    base.auth()
    
    # TEMPORARY VARIABLE TO STORE ROW VALUES FOR COMPARISON
    previous_column = ""
    dfound = 0

    # keep the oldest entry:
    #rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime')
    # keep the newest entry:
    rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime DESC')

    for row in rows:
      # DEBUG:
      #print(row)
      #print(row.get('Title'))

      # COMPARE THIS ROW WITH THE PREVIOUS ONE:
      if row.get(column_to_compare) == previous_column:
        print("Duplicate found at row with the id: " + row.get('_id') + ". Delete this row ...")
        previous_column = row.get(column_to_compare)
        base.delete_row(table_name, row.get('_id'))
        dfound = dfound + 1
      else:
        previous_column = row.get(column_to_compare)

    print(str(dfound) +" duplicates found and deleted")

remove_duplicates()

webdienste · October 11, 2022, 8:17am

This is very interesting as an automation. What would be even more useful for me is if it were possible to not only find duplicates and delete them but to compare old and new entries and update the old entries with the new information (specifically, in my case, updating by adding the new information to the old previously existing row and then deleting the new row).
The reason is that I have a table containing people who have registered for courses in the past. Via Integromat, when they register for a new course, a new row is added to this table. i then have to manually go through, find the duplicates and then add a link to the course in the first entry, then delete this new entry as it is no longer needed. The aim is to have one row per person with a column linking to all courses the have ever registered for.

daniel.pan · October 11, 2022, 11:28am

This is an deduplication plugin:

webdienste · October 11, 2022, 3:14pm

Hi Daniel,

That is really interesting. I am on SeaTable cloud. I see that Deduplication is on Github. How can I add the deduplication plugin to SeaTable?

rdb · October 11, 2022, 3:41pm

Open a base
Hit “Plugins” in the top right corner
Hit “Add plugin”
Choose the deduplication plugin

webdienste · October 12, 2022, 7:53am

Oh, I see, this is the “duplicates” plugin. It was already installed.
The problem here is that I only have the option to delete duplicates, whereas I don’t want to lose information from the new rows but merge the different items of information from one or more columns into one row.
Shall I add this to the ideas / feature requests forum or is there already a python script or formula that can do this??

cdb · October 12, 2022, 8:36am

I doubt that this should be added to the existing deduplication plugin. Every user will have a slightly different requirement, and therefore an automation is the way to go…

Are you experienced enough to extend the python script by yourself? Otherwise, we could offer for you to extend the script to your needs as a service. Please make a concrete example with data, table structure and what should happen. Then we can talk in more details.

webdienste · October 12, 2022, 9:27am

Thanks Christoph,

I have some programming experience (mostly in Javascript) but not in Python. I have been looking around for different ways to achieve the task I describe (and I have posted on this a number of times) but will give some thought to the service idea.

In the meantime, thank you so much for all your work and also your help in the forum!

Tomtable · December 26, 2022, 9:31am

Thanks for this thread and question.

Like the original post, I would like an automation to update a row.

I have a table of people’s contact details. If the same person fills in a form to join, that will create a new row.

Ideally, the automation would look up previous rows and identify a duplicate and add any details that were previously blank fields, and update any fields with new info provided. Once all the data was updated, the new row would be removed.

I have no programming ability (though I plan to learn in 2023). Does this automation exist already in some form?

rdb · December 26, 2022, 11:27am

Unfortunately, there is no merge automation at this point in time available in SeaTable.

To accomplish a merge, you’d have to write a custom script or extend the script above.

ivan07 · December 18, 2023, 12:50pm

Hi, is it possible somehow to configure the Deduplication plugin to not distinct between low and uppercase?

cdb:

# This python script searches for similar rows and deletes these rows
# if you want to compare multiple columns, then add a *formula column* and concatenate the different columns
# here is an example for such a formula : {ColumnA}&"_"&{ColumnB}&"_"&{ColumnC}

# INPUT VARIABLES (only change these three values!):
table_name = 'Table1'
view_name = 'Default View'
column_to_compare = 'the-name-of-the-column'

# DO NOT CHANGE ANYTHING BELOW THIS LINE !
from seatable_api import Base, context

server_url = context.server_url
api_token = context.api_token

def remove_duplicates():
    print("start to search for duplicate rows ...")

    # AUTH
    base = Base(api_token, server_url)
    base.auth()
    
    # TEMPORARY VARIABLE TO STORE ROW VALUES FOR COMPARISON
    previous_column = ""
    dfound = 0

    # keep the oldest entry:
    #rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime')
    # keep the newest entry:
    rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime DESC')

    for row in rows:
      # DEBUG:
      #print(row)
      #print(row.get('Title'))

      # COMPARE THIS ROW WITH THE PREVIOUS ONE:
      if row.get(column_to_compare) == previous_column:
        print("Duplicate found at row with the id: " + row.get('_id') + ". Delete this row ...")
        previous_column = row.get(column_to_compare)
        base.delete_row(table_name, row.get('_id'))
        dfound = dfound + 1
      else:
        previous_column = row.get(column_to_compare)

    print(str(dfound) +" duplicates found and deleted")

remove_duplicates()

cdb · December 18, 2023, 3:48pm

In Python, you can transform a string to lowercases with .lower().

Therefore you have to do this to make the script case insensitive.

if row.get(column_to_compare).lower == previous_column.lower:

cdb · December 28, 2023, 10:50pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.