Skip to content

Upsert of 2 table with a one to many relation onto them: fastest implementation (single vs batch vs one commit vs ?) #437

@tepelbaum

Description

@tepelbaum

First Check

  • I added a very descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the SQLModel documentation, with the integrated search.
  • I already searched in Google "How to X in SQLModel" and didn't find any information.
  • I already read and followed all the tutorial in the docs and didn't find an answer.
  • I already checked if it is not related to SQLModel but to Pydantic.
  • I already checked if it is not related to SQLModel but to SQLAlchemy.

Commit to Help

  • I commit to help with one of those options 👆

Example Code

class Lei(LeiBase, table=True):
    """
    lei db. Start from csv taken here
    https://www.gleif.org/fr/lei-data/lei-mapping/download-isin-to-lei-relationship-files/
    Could be completed by direct gleif api calls
    https://documenter.getpostman.com/view/7679680/SVYrrxuU?version=latest#3e013a79-b5f6-46a7-b9e9-299fde0b3a03
    """
    id: Optional[int] = Field(default=None, primary_key=True)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    isins: List['Isin'] = Relationship(back_populates='lei')

class Isin(SQLModel, table=True):
    """
    isin db. Start from csv taken here
    https://www.gleif.org/fr/lei-data/lei-mapping/download-isin-to-lei-relationship-files/
    Could be completed by direct gleif api calls
    https://documenter.getpostman.com/view/7679680/SVYrrxuU?version=latest#3e013a79-b5f6-46a7-b9e9-299fde0b3a03
    """
    id: Optional[int] = Field(default=None, primary_key=True)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    isin: str = Field(index=True)
    lei_id: Optional[int] = Field(default=None, foreign_key='lei.id')
    lei: Optional['Lei'] = Relationship(back_populates='isins')


def create_or_update_isins_leis(grouped_isins: Dict[str, List[str]], session: Session):
    """
    Code suggestion, to be improved. isin_lei_models is not provided but 
    """
    for lei, isins in grouped_isins.items():
        isin_lei_models(lei, isins, session)
    session.commit()

def isin_lei_models(lei: str, isins: List[str], session: Session):
    """
    Possible way to upsert lei and isins
    """
    in_db_lei = session.exec(select(Lei).where(col(Lei.lei) == lei)).first() or Lei(lei=lei)
    isins_to_insert = list(set(isins) - {isin.isin for isin in in_db_lei.isins})
    for isin in isins_to_insert:
        if in_db_isin := session.exec(select(Isin).where(col(Isin.isin) == isin)).first():
            in_db_isin.lei = lei
            session.add(in_db_isin)
            return
        session.add(Isin(isin=isin, lei=lei))

Description

I simplified the code as much as to focus on my interrogation. I have a simple one to many relation between Lei and Isin. Imagine that I am given new {lei: isins} values and I have to either insert or update the lei. My question is simple: should I commit once at the end or in the for loop just above ? or in the lei loop? Should I batch if I have 10M leis? Is the answer the same in Create and Upsert mode?

I found this related stackoverflow thread https://stackoverflow.com/questions/24377193/best-way-to-update-millions-of-records-in-sql-table-with-foreign-keys, which describes well what I am trying to accomplish, but the solution seems far fetched (buy maybe it is not?)

I can experiment all suggested ideas if needs be :)

Operating System

Linux

Operating System Details

python:3.10-slim docker image

Requirements.in (then converted to txt via pip-compile)

bcrypt==3.2.0
certifi==2021.5.30
cryptography==3.4.8
dash==2.6.0
dash-auth==1.4.1
dash_core_components==2.0.0
dash_bootstrap_components==1.2.0
fastapi==0.80.0
jupyter==1.0.0
jupyter-dash==0.4.2
nltk==3.7
numpy==1.23.1
openpyxl==3.0.10
pandas==1.4.3
passlib==1.7.4
pillow==9.2.0
plotly==5.9.0
psycopg2==2.9.1
pydeps==1.10.22
python-dateutil==2.8.2
python-dotenv==0.19.0
python-editor==1.0.4
python-jose==3.3.0
python-multipart==0.0.5
requests==2.26.0
sqladmin==0.3.0
sqlalchemy==1.4.35 # Needed, otherwise relationship are not working. See https://github.com/tiangolo/sqlmodel/issues/315
sqlmodel==0.0.6
strsimpy==0.2.1
tqdm==4.64.0
uvicorn==0.18.2

SQLModel Version

0.0.6

Python Version

3.10

Additional Context

In creation mode, a csv with 8 millions LEI/ISINS line takes less thant 30 minutes to insert. In upsert I gave up, it was too long with my naive implentation :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions