Tutorial: Python-Script to find and remove duplicates

Do you want an automation that finds and removes duplicated rows in one of your bases? Then try this python-code.

This code loops through all the rows in your base. It finds duplicates and leaves only the newest or the oldest entries. You can decide which rows should be kept.

You can either run this script manually or create an automation (Enterprise only) that is executed every time a row is updated.

And here is the code that should be self-explanatory:

# This python script searches for similar rows and deletes these rows
# if you want to compare multiple columns, then add a *formula column* and concatenate the different columns
# here is an example for such a formula : {ColumnA}&"_"&{ColumnB}&"_"&{ColumnC}

# INPUT VARIABLES (only change these three values!):
table_name = 'Table1'
view_name = 'Default View'
column_to_compare = 'the-name-of-the-column'

# DO NOT CHANGE ANYTHING BELOW THIS LINE !
from seatable_api import Base, context

server_url = context.server_url
api_token = context.api_token

def remove_duplicates():
    print("start to search for duplicate rows ...")

    # AUTH
    base = Base(api_token, server_url)
    base.auth()
    
    # TEMPORARY VARIABLE TO STORE ROW VALUES FOR COMPARISON
    previous_column = ""
    dfound = 0

    # keep the oldest entry:
    #rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime')
    # keep the newest entry:
    rows = base.query('select _id,_ctime,'+column_to_compare +' from '+ table_name +' order by '+ column_to_compare +', _ctime DESC')

    for row in rows:
      # DEBUG:
      #print(row)
      #print(row.get('Title'))

      # COMPARE THIS ROW WITH THE PREVIOUS ONE:
      if row.get(column_to_compare) == previous_column:
        print("Duplicate found at row with the id: " + row.get('_id') + ". Delete this row ...")
        previous_column = row.get(column_to_compare)
        base.delete_row(table_name, row.get('_id'))
        dfound = dfound + 1
      else:
        previous_column = row.get(column_to_compare)

    print(str(dfound) +" duplicates found and deleted")

remove_duplicates()
3 Likes

This is very interesting as an automation. What would be even more useful for me is if it were possible to not only find duplicates and delete them but to compare old and new entries and update the old entries with the new information (specifically, in my case, updating by adding the new information to the old previously existing row and then deleting the new row).
The reason is that I have a table containing people who have registered for courses in the past. Via Integromat, when they register for a new course, a new row is added to this table. i then have to manually go through, find the duplicates and then add a link to the course in the first entry, then delete this new entry as it is no longer needed. The aim is to have one row per person with a column linking to all courses the have ever registered for.

This is an deduplication plugin:

Hi Daniel,

That is really interesting. I am on SeaTable cloud. I see that Deduplication is on Github. How can I add the deduplication plugin to SeaTable?

  • Open a base
  • Hit “Plugins” in the top right corner
  • Hit “Add plugin”
  • Choose the deduplication plugin

Oh, I see, this is the “duplicates” plugin. It was already installed.
The problem here is that I only have the option to delete duplicates, whereas I don’t want to lose information from the new rows but merge the different items of information from one or more columns into one row.
Shall I add this to the ideas / feature requests forum or is there already a python script or formula that can do this??

I doubt that this should be added to the existing deduplication plugin. Every user will have a slightly different requirement, and therefore an automation is the way to go…

Are you experienced enough to extend the python script by yourself? Otherwise, we could offer for you to extend the script to your needs as a service. Please make a concrete example with data, table structure and what should happen. Then we can talk in more details.

1 Like

Thanks Christoph,

I have some programming experience (mostly in Javascript) but not in Python. I have been looking around for different ways to achieve the task I describe (and I have posted on this a number of times) but will give some thought to the service idea.

In the meantime, thank you so much for all your work and also your help in the forum!

1 Like

Thanks for this thread and question.

Like the original post, I would like an automation to update a row.

I have a table of people’s contact details. If the same person fills in a form to join, that will create a new row.

Ideally, the automation would look up previous rows and identify a duplicate and add any details that were previously blank fields, and update any fields with new info provided. Once all the data was updated, the new row would be removed.

I have no programming ability (though I plan to learn in 2023). Does this automation exist already in some form?

Unfortunately, there is no merge automation at this point in time available in SeaTable.

To accomplish a merge, you’d have to write a custom script or extend the script above.

Hi, is it possible somehow to configure the Deduplication plugin to not distinct between low and uppercase?

In Python, you can transform a string to lowercases with .lower().

Therefore you have to do this to make the script case insensitive.

if row.get(column_to_compare).lower == previous_column.lower:

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.