Data Engineer 2.0. Part II: Retrieval Augmented Generation

…language=Language.MARKDOWN, chunk_size=100, chunk_overlap=10, length_function=num_tokens_from_string, ) vectorstore= get_vectorstore(collection_name=”big_fragments”) store= InMemoryStore() retriever= ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter, ) retriever.add_documents(docs) Self-retriever A self-querying retriever, as the name implies, possesses the capability to generate…

Read more

Managing Kubernetes secrets like a Pro

…this SecretStore object, you can try to describe the object: kubectl describe secretstore <your-secretstore-name> You should see the status of the SecretStore as Valid: Status: Conditions: Last Transition Time: xxxx…

Read more

Lakehouse and Warehouse: sharing data between environments

…platform choice, we use AWS Redshift. Metastore: in our case we are using Hive Metastore (we could also have used Glue). Current situation (environment isolation) Our production environment must be…

Read more

Practical tricks and tips to reduce AWS EMR costs

…we found an opportunity to improve the cost efficiency by switching to a more performant disk type. The instance store is an offering by the AWS which maximises disk performance….

Read more

Cookie Policy

…be stored locally on your device through your web browser. Such locally stored data can be used for many different purposes, including to adapt the content and features on the…

Read more

From Hello World to a Dispute Management back office

…and pretty much everything in between) All this data will need to be normalised into our own data model and stored An API will be needed to expose it For…

Read more

Enforcing and controlling Infrastructure as Code

…mostly depends on how many events you process. To get an estimate of the size of events stored, go to your CloudWatch log group and check for Stored Bytes and…

Read more

Machine Learning Engineer (F/H)

…production à l’échelle. Pourquoi nous rejoindre ? Utiliser un environnement tech de pointe – MLFlow, Kubeflow, JupyterHub, Spark, AWS (S3, Redshift, Athena), FastAPI, Kubernetes, accompagnés d’un feature store et d’un…

Read more

Architecting Compliance: Cost-Effective Data Strategies for GDPR

…Kafka, it is stored in Delta Lake format, but with a clear distinction — personally identifiable information (PII) fields are identified and separated as defined in the data contracts for…

Read more