자연어 - MongoDB 쿼리

이 페이지에서는 LLM(대형 언어 모델)을 사용하여 언어 에서 데이터에 대한 MongoDB 쿼리를 생성하는 방법에 대한 지침 제공합니다.

예시 들어, Atlas sample_mflix 데이터베이스 에 대해mongosh 에서 생성된 쿼리 에 다음과 같은 언어 쿼리 가정해 보겠습니다.

다음과 같은 언어 쿼리 주어집니다.

Show me the genres and runtime of
10 movies from 2015 that have
the most comments

그러면 mongosh 다음과 같은 코드가 생성됩니다.

db.movies.aggregate([
  {
    $match: {
      year: 2015,
    },
  },
  {
    $sort: {
      num_mflix_comments: -1,
    },
  },
  {
    $limit: 10,
  },
  {
    $project: {
      _id: 0,
      genres: 1,
      runtime: 1,
    },
  },
]);

사용 가능한 메서드

즉시 LLM을 사용하는 것 외에도 MongoDB 에서 구축한 다음 도구를 사용하여 언어 에서 MongoDB 쿼리를 생성할 수 있습니다.

모델 선택

일반 작업에서 잘 수행되는 모델은 일반적으로 MongoDB 쿼리 생성에서도 잘 수행됩니다. MongoDB 쿼리를 생성하기 위해 LLM을 선택할 때 MMLU-Pro 및 챗봇 투기장 ELO와 같은 인기 있는 벤치마크를 참조하여 모델 간 성능을 평가합니다.

효과적인 프롬프트

이 섹션에서는 LLM이 MongoDB 쿼리를 생성하도록 메시지를 표시하는 효과적인 전략을 간략하게 설명합니다.

참고

다음 메시지 표시 전략은 MongoDB 에서 만든 벤치마크를 기반으로 합니다. 자세한 학습 은 허깅 페이스에서 자연어 언어 에 대한 코드에 대한 공개 벤치마크를 참조하세요.mongosh

기본 프롬프트

시스템 프롬프트라고도 하는 기본 프롬프트는 다음을 포함하여 작업 에 대한 명확한 개요를 제공해야 합니다.

생성할 쿼리 유형입니다.
쿼리 실행하는 운전자 언어 또는 도구와 같은 예상 출력 구조에 대한 정보입니다.

다음 기본 프롬프트 예시 에 대한 MongoDB 읽기 작업 또는 집계 생성하는 방법을 mongosh 보여줍니다.

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

일반 지침

쿼리 품질을 개선하려면 기본 프롬프트에 다음 지침 추가하여 효과적인 MongoDB 쿼리를 생성하기 위한 일반적인 팁을 모델에 제공하세요.

Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate)
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.)
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible
4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. 
8. For Decimal128 operations, prefer range queries over exact equality
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks

생각의 사슬

응답 품질을 개선하기 위해 응답을 생성하기 전에 모델에 '큰 소리로 생각'하도록 프롬프트할 수 있습니다. 사고의 체인 프롬프팅이라고 하는 이 기술은 성능을 향상시키지만 생성 시간과 비용을 증가시킵니다.

모델이 쿼리 생성하기 전에 단계별로 생각하도록 권장하려면 기본 프롬프트에 다음 텍스트를 추가합니다.

Think step by step about the code in the answer before providing it. In your thoughts, consider:
1. Which collections are relevant to the query.
2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed.
3. What fields are relevant to the query.
4. Which indexes you can use to improve performance.
5. What specific transformations or projections are required.
6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.).
7. What edge cases to consider (empty results, null values, missing fields).
8. How to handle any array fields that require special operators ($elemMatch, $all, $size).
9. Any other relevant considerations.

샘플 문서 포함

쿼리 품질을 크게 개선하려면 컬렉션 에서 몇 가지 대표적인 샘플 문서를 포함하세요. 2~3개의 대표 문서는 일반적으로 모델에 데이터 구조에 대한 충분한 컨텍스트를 제공합니다.

샘플 문서를 제공할 때는 다음 지침을 따르세요.

BSON.EJSON.serialize() 함수를 사용하여 프롬프트에 대해 BSON 문서를 EJSON 문자열로 변환합니다.
긴 필드나 깊게 중첩된 객체를 잘라냅니다.
긴 문자열 값은 제외합니다.
벡터 임베딩과 같은 큰 배열의 경우 몇 가지 요소만 포함하세요.

샘플 문서 예시

프롬프트에 포함할 영화 컬렉션 의 샘플 문서

예시 를 들어 sample_mflix 데이터베이스 및 movies 컬렉션 의 경우 프롬프트에 다음 문서를 포함할 수 있습니다.

[
  {
    _id: {
      $oid: "573a13bbf29313caabd526d0",
    },
    plot: "Van Erp shows us what the Dutch do in their spare time and takes a look at the industry behind all t...",
    genres: ["Documentary"],
    runtime: 90,
    title: "Pretpark Nederland",
    num_mflix_comments: 0,
    poster:
      "http://m.media-amazon.com/images/M/MV5BMTUwNjU0ODg3N15BMl5BanBnXkFtZTcwMzg3NjYxNA@@._V1_SY1000_SX67...",
    countries: ["Netherlands"],
    fullplot:
      "Van Erp displays the mechanics behind the Dutch tourism industry. Key figures behind events and dest...",
    languages: ["Dutch", "Mandarin"],
    released: {
      $date: "2006-10-18T00:00:00.000Z",
    },
    directors: ["Michiel van Erp"],
    writers: ["Renè van 't Erve (scenario)", "Michiel van Erp (scenario)"],
    awards: {
      wins: 0,
      nominations: 1,
      text: "1 nomination.",
    },
    lastupdated: "2015-02-26T00:48:24.883Z",
    year: 2006,
    imdb: {
      rating: 7.3,
      votes: 237,
      id: 882800,
    },
    type: "movie",
    tomatoes: {
      viewer: {
        rating: 2.2,
        numReviews: 19,
      },
      dvd: {
        $date: "2010-06-22T00:00:00.000Z",
      },
      lastUpdated: {
        $date: "2014-11-24T14:15:50.000Z",
      },
    },
    hash: {
      low: -1866172407,
      high: -2147460187,
      unsigned: false,
    },
  },
  {
    _id: {
      $oid: "573a13caf29313caabd7c4e0",
    },
    fullplot:
      "A drama centered on a rising country-music songwriter (Hedlund) who sparks with a fallen star (Paltr...",
    imdb: {
      rating: 6.3,
      votes: 14066,
      id: 1555064,
    },
    year: 2010,
    plot: "A rising country-music songwriter works with a fallen star to work their way fame, causing romantic ...",
    genres: ["Drama", "Music"],
    rated: "PG-13",
    metacritic: 45,
    title: "Country Strong",
    lastupdated: "2015-09-03T00:39:54.710Z",
    languages: ["English"],
    writers: ["Shana Feste"],
    type: "movie",
    tomatoes: {
      website: "http://www.countrystrong-movie.com/?hs308=CST6186",
      viewer: {
        rating: 3.3,
        numReviews: 32825,
        meter: 53,
      },
      dvd: {
        $date: "2011-04-12T00:00:00.000Z",
      },
      critic: {
        rating: 4.5,
        numReviews: 130,
        meter: 22,
      },
      boxOffice: "$20.2M",
      consensus:
        "The cast gives it their all, and Paltrow handles her songs with aplomb, but Country Strong's cliched...",
      rotten: 101,
      production: "Screen Gems",
      lastUpdated: {
        $date: "2015-08-17T18:04:40.000Z",
      },
      fresh: 29,
    },
    poster:
      "http://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_SY1000_SX67...",
    num_mflix_comments: 0,
    released: {
      $date: "2011-01-07T00:00:00.000Z",
    },
    awards: {
      wins: 2,
      nominations: 6,
      text: "Nominated for 1 Oscar. Another 1 win & 6 nominations.",
    },
    countries: ["USA"],
    cast: [
      "Gwyneth Paltrow",
      "Tim McGraw",
      "Garrett Hedlund",
      "...and 1 more items",
    ],
  },
];

모범 사례

언어 에서 MongoDB 쿼리를 생성할 때 특정 사용 사례에 대해 다음과 같은 권장사항 적용합니다.

인덱스 정보 포함

프롬프트에 컬렉션 인덱스를 포함하여 LLM이 더 성능이 뛰어난 쿼리를 생성하도록 권장합니다. MongoDB 드라이버와 는 인덱스 정보를 가져오는 메서드를 mongosh 제공합니다. 예시 들어, Node.js 운전자 프롬프트에 대한 인덱스를 가져오기 위해 listIndexes()메서드를 제공합니다.

시간 기반 쿼리

대부분의 LLM 도구는 시스템 프롬프트에 날짜를 포함합니다. 그러나 기본적으로 LLM을 사용하는 경우 모델은 현재 날짜나 시간을 알 수 없습니다. 따라서 기본 모델로 작업하거나 MongoDB 도구에 대한 고유한 언어 구축할 때는 프롬프트에 최신 날짜를 포함하세요. 프로그래밍 언어 의 메서드를 사용하여 JavaScript의 new Date().toString() 또는 Python의 str(datetime.now())와 같이 현재 날짜를 문자열로 가져옵니다.

주석이 달린 데이터베이스 스키마

프롬프트에 관련 데이터베이스 컬렉션의 주석이 달린 스키마를 포함합니다. 모든 LLM에 가장 적합한 단일 표현 방법은 없지만, 일부 접근 방식은 다른 접근 방식보다 더 효과적입니다.

타입스크립트(Typescript) 유형, Python Pydantic 모델 또는 고 (Go) 구조체와 같이 데이터 형태를 설명하는 프로그래밍 언어 네이티브 유형을 사용하여 컬렉션을 표현하는 것이 좋습니다. 이러한 언어의 MongoDB 사용하는 경우 데이터 형태가 이미 정의되어 있을 수 있습니다. LLM을 가이드 하고 모호성을 줄이려면 프롬프트에 각 필드 에 대한 설명을 추가하세요.

다음 예시 sample_mflix.movies 컬렉션 에 대한 TypeScript 유형을 보여줍니다.

TypeScript 스키마 예제

sample_mflix.movies 컬렉션 에 대한 주석이 달린 TypeScript 스키마 예시

interface Movie {
  /**
   * Unique identifier for the movie document.
   */
  _id: ObjectId;
  /**
   * Brief description of the movie's plot.
   */
  plot: string;
  /**
   * List of genres associated with the movie.
   */
  genres: string[];
  /**
   * Duration of the movie in minutes.
   */
  runtime: number;
  /**
   * Title of the movie.
   */
  title: string;
  /**
   * Number of comments on the movie in the mflix system.
   */
  num_mflix_comments: number;
  /**
   * URL to the movie's poster image.
   */
  poster: string;
  /**
   * List of countries where the movie was produced.
   */
  countries: string[];
  /**
   * Detailed description of the movie's plot.
   */
  fullplot: string;
  /**
   * Languages spoken in the movie.
   */
  languages: string[];
  /**
   * Release date of the movie.
   */
  released: Date;
  /**
   * List of directors of the movie.
   */
  directors: string[];
  /**
   * List of writers of the movie.
   */
  writers: string[];
  /**
   * Awards received by the movie.
   */
  awards: {
    /**
     * Number of awards won by the movie.
     */
    wins: number;
    /**
     * Number of award nominations received by the movie.
     */
    nominations: number;
    /**
     * Textual description of the awards.
     */
    text: string;
  };
  /**
   * Last updated timestamp for the movie document.
   */
  lastupdated: string;
  /**
   * Year the movie was released.
   */
  year: number;
  /**
   * IMDb information for the movie.
   */
  imdb: {
    /**
     * IMDb rating of the movie.
     */
    rating: number;
    /**
     * Number of votes the movie received on IMDb.
     */
    votes: number;
    /**
     * IMDb identifier for the movie.
     */
    id: number;
  };
  /**
   * Type of the movie (e.g., movie, series).
   */
  type: string;
  /**
   * Rotten Tomatoes information for the movie.
   */
  tomatoes: {
    /**
     * Viewer ratings on Rotten Tomatoes.
     */
    viewer?: {
      /**
       * Viewer rating score.
       */
      rating: number;
      /**
       * Number of reviews by viewers.
       */
      numReviews: number;
      /**
       * Viewer meter score.
       */
      meter: number;
    };
    /**
     * DVD release date.
     */
    dvd?: Date;
    /**
     * Last updated timestamp for Rotten Tomatoes data.
     */
    lastUpdated?: Date;
    /**
     * Official website for the movie.
     */
    website?: string;
    /**
     * Critic ratings on Rotten Tomatoes.
     */
    critic?: {
      /**
       * Critic rating score.
       */
      rating: number;
      /**
       * Number of reviews by critics.
       */
      numReviews: number;
      /**
       * Critic meter score.
       */
      meter: number;
    };
    /**
     * Box office earnings.
     */
    boxOffice?: string;
    /**
     * Consensus statement from Rotten Tomatoes.
     */
    consensus?: string;
    /**
     * Number of rotten reviews.
     */
    rotten?: number;
    /**
     * Production company.
     */
    production?: string;
    /**
     * Number of fresh reviews.
     */
    fresh?: number;
  };
  /**
   * Hash value for the movie document.
   */
  hash: Long;
  /**
   * MPAA rating of the movie.
   */
  rated?: string;
  /**
   * Metacritic score of the movie.
   */
  metacritic?: number;
  /**
   * List of main cast members in the movie.
   */
  cast: string[];
}

Prompt Template

다음 예시 이 페이지에 설명된 전략을 사용하여 언어 에서 코드를 mongosh 생성하는 전체 프롬프트를 보여 줍니다.

기본 프롬프트 예제

다음 시스템 프롬프트 예시 MongoDB 쿼리 생성 작업의 템플릿으로 사용하세요. 샘플 프롬프트에는 다음 구성 요소가 포함되어 있습니다.

작업 개요 및 예상 출력 형식
일반 MongoDB 쿼리 작성 지침

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`
Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate).
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.).
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible.
4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management.
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays.
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null.
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries.
8. For Decimal128 operations, prefer range queries over exact equality.
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.

참고

사고의 연쇄(Chain-of-Think) 프롬프트를 추가하여 코드를 생성하기 전에 단계별 사고를 권장할 수도 있습니다.

사용자 메시지 템플릿

그런 다음 다음 사용자 메시지 템플릿을 사용하여 데이터베이스 및 원하는 쿼리 대한 필요한 컨텍스트를 모델에 제공합니다.

Generate MongoDB Shell (mongosh) queries for the following database and natural language query:
## Database Information
Name: {{Database name}}
Description: {{database description}}
Latest Date: {{latest date}} (use this to inform dates in queries)
### Collections
#### Collection `{{collection name. Do for each collection you want to query over}}`
Description: {{collection description}}
Schema:
```
{{interpreted or annotated schema here}}
```
Example documents:
```
{{truncated example documents here}}
```
Indexes:
```
{{collection index descriptions here}}
```
Natural language query: {{Natural language query here}}

돌아가기

SQL에서 MongoDB로

텍스트 검색