自然言語からMongoDBクエリへ

このページでは、大規模言語モデル（LVM）を使用して自然言語からデータに対するMongoDBクエリを生成する方法に関するガイダンスが提供されます。

mongosh例、Atlas sample_mflixデータベースので生成されたクエリに対する次の自然言語クエリを考えてみましょう。

次の自然言語クエリがあるとします。

Show me the genres and runtime of
10 movies from 2015 that have
the most comments

これにより、次のmongosh コードが生成されます。

db.movies.aggregate([
  {
    $match: {
      year: 2015,
    },
  },
  {
    $sort: {
      num_mflix_comments: -1,
    },
  },
  {
    $limit: 10,
  },
  {
    $project: {
      _id: 0,
      genres: 1,
      runtime: 1,
    },
  },
]);

利用可能なメソッド

LRM を使用するだけでなく、 MongoDBが構築した次のツールを使用して、自然言語からMongoDBクエリを生成できます。

モデルの選択

一般的なタスクで優れたパフォーマンスを発揮するモデルは、通常、 MongoDBクエリ生成でも優れたパフォーマンスを発揮します。 LVM を選択してMongoDBクエリを生成する場合は、 MMLU-Pro や Connector Edition などの一般的なベンチマークを参照して、モデル間のパフォーマンスを評価します。

効果的なプロンプト

このセクションでは、LVM にMongoDBクエリの生成を求めるための効果的な戦略について説明します。

注意

次のプロンプト戦略は、 MongoDBによって作成されたベンチマークに基づいています。詳しくは、「mongosh Hugeface の自然言語へのコードの公開ベンチマーク」を参照してください。

基本的なプロンプト

システムプロンプトとも呼ばれるベースプロンプトには、次のようなタスクの概要が明確に表示されている必要があります。

生成するクエリのタイプ。
ドライバー言語やクエリを実行するツールなど、予想される出力構造に関する情報。

次の基本プロンプトの例は、のMongoDB読み取り操作または集計を生成する方法を示しています。mongosh

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

一般的なガイダンス

クエリの品質を向上させるには、基本プロンプトに次のガイダンスを追加して、効果的なMongoDBクエリを生成するための一般的なヒントをモデルに提供します。

Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate)
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.)
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible
4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. 
8. For Decimal128 operations, prefer range queries over exact equality
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks

可能性の連鎖

応答の品質を向上させるために、応答を生成する前にモデルに「大規模な選択」を要求できます。ソートの連鎖と呼ばれるこの手法はパフォーマンスを向上させますが、生成時間とコストを増加させます。

クエリを生成する前にモデルに段階的に検討させるには、次のテキストを基本プロンプトに追加します。

Think step by step about the code in the answer before providing it. In your thoughts, consider:
1. Which collections are relevant to the query.
2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed.
3. What fields are relevant to the query.
4. Which indexes you can use to improve performance.
5. What specific transformations or projections are required.
6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.).
7. What edge cases to consider (empty results, null values, missing fields).
8. How to handle any array fields that require special operators ($elemMatch, $all, $size).
9. Any other relevant considerations.

サンプルドキュメントの追加

クエリの品質を大幅に向上させるには、コレクションからいくつかのサンプルサンプルを含めます。通常、2 から 3 の表すドキュメントは、データ構造に関する十分なコンテキストをモデルに提供します。

サンプルドキュメントを提供する場合は、次のガイドラインに従います。

BSON.EJSON.serialize() 関数を使用して、 BSONドキュメントをプロンプト用の EJSON string に変換します。
長いフィールドまたは深くネストされたオブジェクトを切り捨てます。
長い文字列値は除外します。
ベクトル埋め込みのような大きな配列の場合、含める要素は数少ない要素のみを含めます。

サンプルドキュメントの例

プロンプトに含める映画コレクションのサンプルドキュメント

例、sample_mflixデータベースと moviesコレクションでは、プロンプトに次のドキュメントを含めることができます。

[
  {
    _id: {
      $oid: "573a13bbf29313caabd526d0",
    },
    plot: "Van Erp shows us what the Dutch do in their spare time and takes a look at the industry behind all t...",
    genres: ["Documentary"],
    runtime: 90,
    title: "Pretpark Nederland",
    num_mflix_comments: 0,
    poster:
      "http://m.media-amazon.com/images/M/MV5BMTUwNjU0ODg3N15BMl5BanBnXkFtZTcwMzg3NjYxNA@@._V1_SY1000_SX67...",
    countries: ["Netherlands"],
    fullplot:
      "Van Erp displays the mechanics behind the Dutch tourism industry. Key figures behind events and dest...",
    languages: ["Dutch", "Mandarin"],
    released: {
      $date: "2006-10-18T00:00:00.000Z",
    },
    directors: ["Michiel van Erp"],
    writers: ["Renè van 't Erve (scenario)", "Michiel van Erp (scenario)"],
    awards: {
      wins: 0,
      nominations: 1,
      text: "1 nomination.",
    },
    lastupdated: "2015-02-26T00:48:24.883Z",
    year: 2006,
    imdb: {
      rating: 7.3,
      votes: 237,
      id: 882800,
    },
    type: "movie",
    tomatoes: {
      viewer: {
        rating: 2.2,
        numReviews: 19,
      },
      dvd: {
        $date: "2010-06-22T00:00:00.000Z",
      },
      lastUpdated: {
        $date: "2014-11-24T14:15:50.000Z",
      },
    },
    hash: {
      low: -1866172407,
      high: -2147460187,
      unsigned: false,
    },
  },
  {
    _id: {
      $oid: "573a13caf29313caabd7c4e0",
    },
    fullplot:
      "A drama centered on a rising country-music songwriter (Hedlund) who sparks with a fallen star (Paltr...",
    imdb: {
      rating: 6.3,
      votes: 14066,
      id: 1555064,
    },
    year: 2010,
    plot: "A rising country-music songwriter works with a fallen star to work their way fame, causing romantic ...",
    genres: ["Drama", "Music"],
    rated: "PG-13",
    metacritic: 45,
    title: "Country Strong",
    lastupdated: "2015-09-03T00:39:54.710Z",
    languages: ["English"],
    writers: ["Shana Feste"],
    type: "movie",
    tomatoes: {
      website: "http://www.countrystrong-movie.com/?hs308=CST6186",
      viewer: {
        rating: 3.3,
        numReviews: 32825,
        meter: 53,
      },
      dvd: {
        $date: "2011-04-12T00:00:00.000Z",
      },
      critic: {
        rating: 4.5,
        numReviews: 130,
        meter: 22,
      },
      boxOffice: "$20.2M",
      consensus:
        "The cast gives it their all, and Paltrow handles her songs with aplomb, but Country Strong's cliched...",
      rotten: 101,
      production: "Screen Gems",
      lastUpdated: {
        $date: "2015-08-17T18:04:40.000Z",
      },
      fresh: 29,
    },
    poster:
      "http://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_SY1000_SX67...",
    num_mflix_comments: 0,
    released: {
      $date: "2011-01-07T00:00:00.000Z",
    },
    awards: {
      wins: 2,
      nominations: 6,
      text: "Nominated for 1 Oscar. Another 1 win & 6 nominations.",
    },
    countries: ["USA"],
    cast: [
      "Gwyneth Paltrow",
      "Tim McGraw",
      "Garrett Hedlund",
      "...and 1 more items",
    ],
  },
];

ベストプラクティス

自然言語からMongoDBクエリを生成する場合、特定のユースケースに次のプロンプトベストプラクティスを適用します。

インデックス情報の包含

LM がよりパフォーマンスの高いクエリを生成するようにするには、プロンプトにコレクションインデックスを含めます。 MongoDBドライバーとmongosh は、インデックス情報を取得するためのメソッドを提供します。例、 Node.jsドライバーは、プロンプトのインデックスを取得するための listIndexes() メソッドを提供します。

時間ベースのクエリ

ほとんどの LVM ツールには、システムプロンプトに日付が含まれています。ただし、ボックスから LM を使用している場合、モデルは現在の日付や時刻を認識しません。したがって、基本モデルを使用して作業する場合、またはMongoDBツールに独自の自然言語を構築する場合は、プロンプトに最新の日付を含めてください。プログラミング言語のメソッドを使用して、JavaScript の new Date().toString() や Python の str(datetime.now()) などの文字列として現在の日付を取得します。

注釈付きデータベーススキーマ

関連するデータベースコレクションの注釈付きスキーマをプロンプトに含めます。すべての LVM に最適な表現方法はありませんが、一部のアプローチは他のアプローチよりも効果的です。

TypeScript 型、 Python構文モデル、 Go構造体など、データ形状を記述するプログラミング言語ネイティブ型を使用してコレクションを表現することをお勧めします。これらの言語からMongoDBを使用している場合は、データ型はすでに定義されている可能性があります。 LM をガイドあいまいさを減らすには、各フィールドを説明するためにプロンプトにコメントを追加します。

次の例では、sample_mflix.moviesコレクションの TypeScript 型を示しています。

TypeScript スキーマの例

sample_mflix.moviesコレクションの注釈付き TypeScriptスキーマの例

interface Movie {
  /**
   * Unique identifier for the movie document.
   */
  _id: ObjectId;
  /**
   * Brief description of the movie's plot.
   */
  plot: string;
  /**
   * List of genres associated with the movie.
   */
  genres: string[];
  /**
   * Duration of the movie in minutes.
   */
  runtime: number;
  /**
   * Title of the movie.
   */
  title: string;
  /**
   * Number of comments on the movie in the mflix system.
   */
  num_mflix_comments: number;
  /**
   * URL to the movie's poster image.
   */
  poster: string;
  /**
   * List of countries where the movie was produced.
   */
  countries: string[];
  /**
   * Detailed description of the movie's plot.
   */
  fullplot: string;
  /**
   * Languages spoken in the movie.
   */
  languages: string[];
  /**
   * Release date of the movie.
   */
  released: Date;
  /**
   * List of directors of the movie.
   */
  directors: string[];
  /**
   * List of writers of the movie.
   */
  writers: string[];
  /**
   * Awards received by the movie.
   */
  awards: {
    /**
     * Number of awards won by the movie.
     */
    wins: number;
    /**
     * Number of award nominations received by the movie.
     */
    nominations: number;
    /**
     * Textual description of the awards.
     */
    text: string;
  };
  /**
   * Last updated timestamp for the movie document.
   */
  lastupdated: string;
  /**
   * Year the movie was released.
   */
  year: number;
  /**
   * IMDb information for the movie.
   */
  imdb: {
    /**
     * IMDb rating of the movie.
     */
    rating: number;
    /**
     * Number of votes the movie received on IMDb.
     */
    votes: number;
    /**
     * IMDb identifier for the movie.
     */
    id: number;
  };
  /**
   * Type of the movie (e.g., movie, series).
   */
  type: string;
  /**
   * Rotten Tomatoes information for the movie.
   */
  tomatoes: {
    /**
     * Viewer ratings on Rotten Tomatoes.
     */
    viewer?: {
      /**
       * Viewer rating score.
       */
      rating: number;
      /**
       * Number of reviews by viewers.
       */
      numReviews: number;
      /**
       * Viewer meter score.
       */
      meter: number;
    };
    /**
     * DVD release date.
     */
    dvd?: Date;
    /**
     * Last updated timestamp for Rotten Tomatoes data.
     */
    lastUpdated?: Date;
    /**
     * Official website for the movie.
     */
    website?: string;
    /**
     * Critic ratings on Rotten Tomatoes.
     */
    critic?: {
      /**
       * Critic rating score.
       */
      rating: number;
      /**
       * Number of reviews by critics.
       */
      numReviews: number;
      /**
       * Critic meter score.
       */
      meter: number;
    };
    /**
     * Box office earnings.
     */
    boxOffice?: string;
    /**
     * Consensus statement from Rotten Tomatoes.
     */
    consensus?: string;
    /**
     * Number of rotten reviews.
     */
    rotten?: number;
    /**
     * Production company.
     */
    production?: string;
    /**
     * Number of fresh reviews.
     */
    fresh?: number;
  };
  /**
   * Hash value for the movie document.
   */
  hash: Long;
  /**
   * MPAA rating of the movie.
   */
  rated?: string;
  /**
   * Metacritic score of the movie.
   */
  metacritic?: number;
  /**
   * List of main cast members in the movie.
   */
  cast: string[];
}

Prompt Template

次の例は、このページで説明されている戦略を使用して、自然言語からmongosh コードを生成する完全なプロンプトを示しています。

基本的なプロンプトの例

次のシステムプロンプトの例を、 MongoDBクエリ生成タスクのテンプレートとして使用します。サンプルプロンプトには、次のコンポーネントが含まれています。

タスクの概要と予想される出力形式
一般的なMongoDBクエリ作成ガイダンス

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`
Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate).
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.).
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible.
4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management.
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays.
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null.
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries.
8. For Decimal128 operations, prefer range queries over exact equality.
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.

注意

また、コード生成前に段階的な検討を促します。

ユーザーメッセージテンプレート

次に、次のユーザーメッセージのテンプレートを使用して、データベースと目的のクエリに関する必要なコンテキストをモデルに提供します。

Generate MongoDB Shell (mongosh) queries for the following database and natural language query:
## Database Information
Name: {{Database name}}
Description: {{database description}}
Latest Date: {{latest date}} (use this to inform dates in queries)
### Collections
#### Collection `{{collection name. Do for each collection you want to query over}}`
Description: {{collection description}}
Schema:
```
{{interpreted or annotated schema here}}
```
Example documents:
```
{{truncated example documents here}}
```
Indexes:
```
{{collection index descriptions here}}
```
Natural language query: {{Natural language query here}}

戻る

SQL から MongoDB へ

テキスト検索