Sometimes we require large EC2 instances for training our data science models. As these heavy machines are too expensive, I don't want any of them running forever. Is it possible to create a separate ASG(Node Group) in EKS with zero initial instances and when a Job is launched with the Node Selector tag as this new Node Group, it should launch the server and terminate it once the job is completed and no more pod has the Node Group selector for this group?
Check out the Cluster Autoscaler. It allows you to dynamically scale your EKS cluster based on resource requests in your Kubernetes cluster, and additionally allows you to request specific types of nodes when scaling. It has support for multiple AWS ASGs and scaling to zero. Whenever a job is scheduled, the Cluster Autoscaler will check if there are enough resources available. If that is not the case it will dynamically create new ec2 instances.
The linked guide should provide enough details to set it up.