Supported models

The chat API currently supports these models:

Llama3.1 8B

  • Model id: llama3.1 or llama3.1:8b
  • Full model name: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
  • Context length: 8192 tokens

Llama3.1 70B

  • Model id: llama3.1:70b
  • Full model name: neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16
  • Context length: 128k (131072) tokens

Llama3.1 405B

  • Model id: llama3.1:405b
  • Full model name: neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
  • Context length: 128k (131072) tokens

Embedding models

The embeddings API currently supports these models:

gte-large-en-v1.5

  • Model id: gte-large-en-v1.5
  • Full model name: Alibaba-NLP/gte-large-en-v1.5
  • Max input token count: 8192
  • Output dimensions: 1024