{"id":18391743,"url":"https://github.com/moscicky/qdrl","last_synced_at":"2026-02-23T03:40:25.704Z","repository":{"id":83322206,"uuid":"466451405","full_name":"moscicky/qdrl","owner":"moscicky","description":"query-document representation learning","archived":false,"fork":false,"pushed_at":"2022-11-26T09:39:49.000Z","size":7783,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T14:49:47.046Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moscicky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-05T12:48:47.000Z","updated_at":"2023-03-30T21:19:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"7585cbff-5318-49f4-b75e-29f271e989e0","html_url":"https://github.com/moscicky/qdrl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moscicky%2Fqdrl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moscicky%2Fqdrl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moscicky%2Fqdrl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moscicky%2Fqdrl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moscicky","download_url":"https://codeload.github.com/moscicky/qdrl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247589394,"owners_count":20963018,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T01:53:08.233Z","updated_at":"2026-02-23T03:40:20.677Z","avatar_url":"https://github.com/moscicky.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# qdrl\n\nThis repository contains training code used for my master thesis titled \n`Joint Multi-Modal Query-Document Representation Learning` which you can read [here](thesis.pdf).\n\n## Training \nModel training is configured by `config.yaml` file with training parameters.\n\nModel training is done on Google Cloud Platform using Vertex AI Training with custom image. Configs, datasets and models\nare stored on Google Cloud Storage, gcsfuse is required.\n\nThe training flow is the following:\n1. Create local training config - `local_config.yaml`\n2. Create training docker image using [dedicated script](run_train_local.sh). Point the training \nscript to `local_config.yaml`, as well as your GCP project. The script will output your `${image_name}`.\n3. Publish the docker image to gcr: `docker push ${image_name}` \n4. Upload the training config to GCS .\n5. Run the [gcp training script](run_train_gcp_gpu.sh) with correct `CONTAINER_IMAGE_URI` and `config.yaml` gcs path.\n\nModels checkpoint is saved after each epoch, model from last epoch is saved separately. \n\nEvaluation metrics - `recall@k` and `mrr@k` are saved and can be visualized on tensorboard. \n\nEmbedding visualization can be optionally turned on if you want to play with it on TB projector.\n\n## Datasets\nThere are 3 required datasets for training and evaluation (details are in the thesis). \n\n1. Training dataset - pairs of query, relevant document\n2. Evaluation queries dataset (`recall_validation_queries_dataset`) - pairs of query, relevant document id\n3. Evaluation documents dataset (`recall_validation_items_dataset`) - candidate pool for evaluation\n\n## Config\n[Example config](configs/example.yaml)\n\nSupported training parameters\n- task_id  \n- run_id\n- num_epochs\n- dataset_dir\n- batch_size\n- learning_rate\n- reuse_epoch\n- dataloader_workers\n- dataset - structure of training features\n- loss - can be `batch_softmax` or `triplet`\n- text_vectorizer - path to the token dictionary and tokenization config (word_unigram, word_bigram, char_trigram + oov)\n- model - can be `SimpleTextEncoder`, `TwoTower`, or `MultiModalTwoTower`\n- recall_validation - for what 'k' validation should be run and whether to generate dataset with typos\n\n## Acknowledgments \n\nResearch papers can be found in the thesis. For the code part special thanks goes to:\n\n- https://github.com/adambielski/siamese-triplet\n- https://gist.github.com/danmelton/183313\n- https://stackoverflow.com/a/58144658/7073537\n\n## FAQ\n\n#### 1. The codebase is awful and does not have tests, why?\n\nBest engineering practices do not apply to master thesis, sorry\n\n#### 2. What does 'qdrl' mean?\n\nqdrl stands for **Q**uery **D**ocument **R**epresentation **L**earning\n\n#### 3. No distributed training?\n\nNo.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoscicky%2Fqdrl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoscicky%2Fqdrl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoscicky%2Fqdrl/lists"}